What is false disk failure, and why is it a problem?

By Rob Ober on March 25, 2013, 2:56 AM
Editor’s Note:
This is a guest post by Rob Ober, corporate strategist at LSI. Prior to joining LSI, Rob was a fellow in the Office of the CTO at AMD. He was also a founding board member of OLPC ($100 laptop.org) and OpenSPARC.

I’ve spent a lot of time with mega datacenters (MDCs) around the world trying to understand their problems – and I really don’t care what area those problems are as long as they’re important to the datacenter. What is the #1 Real Problem for many large scale mega datacenters? It’s something you’ve probably never heard about, and probably have not even thought about. It’s called false disk failure. Some mega datacenters have crafted their own solutions – but most have not.

Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyone’s book that’s a lot. It’s also a very interesting statistical sample size of HDDs. MDCs get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fry’s store. So you would imagine if a disk fails – no one cares – they’re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:

  • Disk rebuild and/or data replicate of 2TB or 3TB drive
    • Performance overhead of a RAID rebuild makes it difficult to justify, and can take days
    • Disk capacity must be added somewhere to compensate: ~$40-$50
    • Redistribute replicated data across many servers
    • Infrastructure overhead to rebalance workloads to other distributed servers
    • Person to service disk: remove and replace
      • And then ensure the HDD data cannot be accessed – wipe it or shred it

Let’s put some scale to this problem, and you’ll begin to understand the issue.  One modest size MDC has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other MDCs, but they are still huge – more than 200k servers). Other MDCs I have checked with say – yep, that’s about right. And one engineer I know at an HDD manufacturer said – “wow – I expected worse than that. That’s pretty good.” To be clear – these are very good HDDs they are using, it’s just that the numbers add up.

The raw data:

RAIDed SAS HDDs

  • 300k SAS HDDs
  • 15-30 SAS failed per day
    • SAS false fail rate is about 30%~45% (10-15 per day)
    • About 1/1000 HDD annual false failure rate

Non-RAIDed (direct map) SATA drives behind HBAs

  • 1.2M SATA HDDs
  • 60-80 SATA failed disks per day
    • SATA false fail rate is about 40~55% (24-40 per day)
    • About 1/100 HDD annual false failure rate

What’s interesting is the relative failure rate of SAS drives vs. SATA. It’s about an order of magnitude worse in SATA drives than SAS. Frankly some of this is due to protocol differences. SAS allows far more error recovery capabilities, and because they also tend to be more expensive, I believe manufacturers invest in slightly higher quality electronics and components. I know the electronics we ship into SAS drives is certainly more sophisticated than SATA drives.

False fail? What? Yea, that’s an interesting topic. It turns out that about 40% of the time with SAS and about 50% of the time with SATA, the drive didn’t actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why? No one knows. I suspect though.

I used to work on engine controllers. That’s a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that’s millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced. No one is willing to take that risk. So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get operational again in less than a full revolution of the engine.  Why? – the events were statistically rare. The average controller might see 1 or 2 events in its lifetime, and a turn of the ignition would reset that state.  But the events do happen, and so do recalls and lawsuits… HDD controllers don’t have these protections, which is reasonable. It would be an inappropriate cost burden for their price point.

You remember the Toyota Prius accelerator problems? I know that controller was not protected for soft errors. And the source of the problem remained a “mystery.”  Maybe it just lost its marbles for a while? A false fail if you will. Just sayin’.

Back to HDDs. False fail is especially frustrating, because half the HDDs actually didn’t need to be replaced. All the operational costs were paid for no reason. The disk just needed a power cycle reset. (OK, that introduces all sorts of complex management by the RAID controller or application to manage that 10 second power reset cycle and application traffic created in that time – be we can handle that.)

Daily, this datacenter has to:

  • Physically replace 100 disk drives
    • Individually destroy or recycle the 100 failed drives
    • Replicate or rebuild 200-300 TBytes of data – just think about that
    • Rebalance the application load on at least 100 servers – more likely 100 clusters of servers – maybe 20,000 servers?
    • Handle the network traffic  load of ~200 TBytes of replicated data
      • That’s on the order of 50 hours of 10GBit Ethernet traffic…

And 1/2 of that is for no reason at all.

First – why not rebuild the disk if it’s RAIDed? Usually MDCs use clustered applications. A traditional RAID rebuild drives the server performance to ~50%, and for a 2TByte drive, under heavy application load (definition of a MDC) can truly take up to a week.  50% performance for a week? In a cluster that means the overall cluster is running ~50% performance.  Say 200 nodes in a cluster – that means you just lost ~100 nodes of work – or 50% of cluster performance. It’s much simpler to just take the node offline with the failed drive, and get 99.5% cluster performance, and operationally redistribute the workload across multiple nodes (because you have replicated data elsewhere). But after rebuild, the node will have to be re-synced or re-imaged. There are ways to fix all this. We’ll talk about them on another day. Or you can simply run direct mapped storage, and unmounts the failed drive.

Next – Why replicate data over the network, and why is that a big deal? For geographic redundancy (say a natural disaster at one facility) and regional locality, MDCs need multiple data copies. Often 3 copies so they can do double duty as high-availability copies, or in the case of some erasure coding, 2.2 to 2.5 copies (yea – weird math – how do you have 0.5 copy…). When you lose one copy, you are down to 2, possibly 1. You need to get back to a reliable number again. Fast. Customers are loyal because of your perfect data retention. So you need to replicate that data and re-distribute it across the datacenter on multiple servers. That’s network traffic, and possibly congestion, which affects other aspects of the operations of the datacenter. In this datacenter it’s about 50 hours of 10G Ethernet traffic every day.

To be fair, there is a new standard in SAS interfaces that will facilitate resetting a disk in-situ. And there is the start of discussion of the same around SATA – but that’s more problematic. Whatever the case, it will be a years before the ecosystem is in place to handle the problems this way.

What’s that mean to you?

Well. You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. You’ll still pay all the operational overhead of not actually having a failed drive – rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive – what $600 or so ?… Depending on your size, that’s either a don’t care, or a big deal. There are ways to handle this, and they’re not expensive – much less than the disk carrier you already pay for to allow you to replace that drive – and it can be handled transparently – just a log entry without seeing any performance hiccups.  You just need to convince your OEM to carry the solution.

Rob Ober drives LSI into new technologies, businesses and products as an LSI fellow in Corporate Strategy. Prior to joining LSI, he was a fellow in the Office of the CTO at AMD, responsible for mobile platforms, embedded platforms and wireless strategy. He was a founding board member of OLPC ($100 laptop.org) and OpenSPARC.

Republished with permission.




User Comments: 14

Got something to say? Post a comment
cliffordcooley cliffordcooley, TechSpot Paladin, said:

Excellent article!!!

Edit:

For some reason or other as long as I can remember, I only have access to the first row of smilies. I wanted to use the Thumbs Up smiley, but clicking the image(using Opera) will not insert it into the comment.

2 people like this | Guest said:

Excellent article indeed!!!

In regards to the comment about the Smilies.... Maybe Opera lost it's marbles for a while :D

Chazz said:

I'd look forward to reading an article about workarounds that people use to combat this.

hitech0101 said:

Nice article even though I dind't understand a lot of it

Zeromus said:

I was thinking maybe the testing environment isn't the same when the disks failed? I've had external hard drives fail sector reads during a massive copy to backup, but it's always when it's been hours into operation and is already had a significant increase in temperature.

Staff
Rick Rick, TechSpot Staff, said:

While "lost their marbles" hardly serves as a meaningful explanation, it would seem that some types of "intermittent" failure can span inexplicably long periods of time and perhaps even correct themselves.

I've had a few clients with ostensibly bad HDDs -- ones that failed SMART tests *and* had either failed to boot or would cause the system to periodically hang/crash. In the three instances I can recall, I know at least two of those clients are still using the same HDDs years later and haven't had additional issues.

The best example of this, a Macbook I had worked on in 2009, had a HDD that would intermittently (once or twice a month) freeze during operation and produce repeated clicks. Mac OS would hang and the laptop would need to be restarted.

The drive failed SMART tests. I urged her to replace it, but she decided she didn't have any important data and decided she would run it into the ground until it became unusable.

As it happens, I spoke to her a couple months ago regarding another issue. I asked her about her old HDD and she told me the problem occurred with decreasing frequency until about 2010 -- when the symptoms stopped entirely. She hasn't experienced the issue in about three years, which may serve as empirical evidence that HDDs possibly do "lose their marbles" sometimes. :-)

JC713 JC713 said:

Very good article.

yukka, TechSpot Paladin, said:

Best article I've read for a while

Guest said:

I would have every production server have each drive be in a 2-drive Linux software RAID1. RAID1 rebuilds are extremely fast, and then you could store the backup for that RAID1 on single non-raided drives in another facility as either a compressed disk image or use something like rsync to sync file changes, using VFS to ensure that you get a consistent snapshot of the drive.

Then I'd set up some sort of shell script which tries to listen for SMART events and reset the drive upto twice a day if it detects a drive failure, and then if it exceeds that count, only then does it alert someone via email to replace the drive.

I'm sure it's not that simple, but it's a start. Drives are cheap, even cheaper if you buy cheaper ones and just add more redundancy. I don't know how often they're doing manual data recovery off failed drives, but if you're having to actually rebuild more than a day's worth of data over 1% of the time when you have drive failures, to me that means you're doing something wrong.

St1ckM4n St1ckM4n said:

A very interesting read, thanks.

Guest said:

Amazing article.really opened up my eyes on how a false fail or even a real fail of hdds can impact on MDCs like this.

1 person liked this | SCJake said:

@Guest pertaining to RAID1.

The problem with RAID1 is performance and corruption. Most huge Datacenters use something to the point of RAID5(5+2)+1 aka Each physical server has a RAID5 array with 2 hot spares sitting there waiting to go and every rack/server array/aisle is in "RAID" 5 with a hot server sitting there waiting to accept the load of a failed server

Granted what you have going on for smaller clients is perfect (oddly similar to what I do for my clients!) but with a large datacenter (200k servers in one room!) that doesnt really cut it. You have to remember that these guys often have 100 clients each with 4-5 offices all coming to this one physical server for their data.

Guest said:

I guess the cockroach moved to a different part of the drive or got squashed by the test and now the drive is fine. Got to wait till the others hatch and it is all over again.

Phr3d said:

It would be an interesting experiment - re-think clean-room at the manufacturer AND the data center itself, could it simply be 'bugs' given the aural density that we are using now? A microscopic 'bug' could live, thrive and die in nanoseconds to be washed away by the mechanicals and untraceable at even a forensic level.

If an entirely new process of clean room was introduced and tested, could the failure rate improve?

Else, and this seems to be your reference, we just Expect that these failures will occur and agree on a standard to, simplistically speaking, reset the drives every so often and performance aim at, again simplistically, 4 of 5 drives Ever being available at one time in an array.

The performance hit could be offset by the elimination of reaction-to-fail costs.

Load all comments...

Add New Comment

TechSpot Members
Login or sign up for free,
it takes about 30 seconds.
You may also...
Get complete access to the TechSpot community. Join thousands of technology enthusiasts that contribute and share knowledge in our forum. Get a private inbox, upload your own photo gallery and more.