What is false disk failure, and why is it a problem?

By Julio Franco
Mar 25, 2013
Post New Reply
  1. What is the #1 Real Problem for many large scale mega datacenters? It’s something you’ve probably never heard about, and probably have not even thought about. It’s called false disk failure. Some mega datacenters have crafted their own solutions –...

    Read more
    cliffordcooley likes this.
  2. cliffordcooley

    cliffordcooley TechSpot Paladin Posts: 5,082   +1,184

    Excellent article!!!

    Edit:
    For some reason or other as long as I can remember, I only have access to the first row of smilies. I wanted to use the Thumbs Up smiley, but clicking the image(using Opera) will not insert it into the comment.
  3. Excellent article indeed!!!

    In regards to the comment about the Smilies.... Maybe Opera lost it's marbles for a while :D
    Chazz and cliffordcooley like this.
  4. Chazz

    Chazz TechSpot Enthusiast Posts: 617   +54

    I'd look forward to reading an article about workarounds that people use to combat this.
  5. hitech0101

    hitech0101 TechSpot Enthusiast Posts: 423   +19

    Nice article even though I dind't understand a lot of it
  6. Zeromus

    Zeromus TechSpot Enthusiast Posts: 230   +7

    I was thinking maybe the testing environment isn't the same when the disks failed? I've had external hard drives fail sector reads during a massive copy to backup, but it's always when it's been hours into operation and is already had a significant increase in temperature.
  7. Rick

    Rick TechSpot Staff Posts: 6,304   +52 Staff Member

    While "lost their marbles" hardly serves as a meaningful explanation, it would seem that some types of "intermittent" failure can span inexplicably long periods of time and perhaps even correct themselves.

    I've had a few clients with ostensibly bad HDDs -- ones that failed SMART tests *and* had either failed to boot or would cause the system to periodically hang/crash. In the three instances I can recall, I know at least two of those clients are still using the same HDDs years later and haven't had additional issues.

    The best example of this, a Macbook I had worked on in 2009, had a HDD that would intermittently (once or twice a month) freeze during operation and produce repeated clicks. Mac OS would hang and the laptop would need to be restarted.

    The drive failed SMART tests. I urged her to replace it, but she decided she didn't have any important data and decided she would run it into the ground until it became unusable.

    As it happens, I spoke to her a couple months ago regarding another issue. I asked her about her old HDD and she told me the problem occurred with decreasing frequency until about 2010 -- when the symptoms stopped entirely. She hasn't experienced the issue in about three years, which may serve as empirical evidence that HDDs possibly do "lose their marbles" sometimes. :)
  8. JC713

    JC713 TechSpot Evangelist Posts: 6,120   +732

    Very good article.
  9. yukka

    yukka TechSpot Paladin Posts: 669   +23

    Best article I've read for a while :)
  10. I would have every production server have each drive be in a 2-drive Linux software RAID1. RAID1 rebuilds are extremely fast, and then you could store the backup for that RAID1 on single non-raided drives in another facility as either a compressed disk image or use something like rsync to sync file changes, using VFS to ensure that you get a consistent snapshot of the drive.

    Then I'd set up some sort of shell script which tries to listen for SMART events and reset the drive upto twice a day if it detects a drive failure, and then if it exceeds that count, only then does it alert someone via email to replace the drive.

    I'm sure it's not that simple, but it's a start. Drives are cheap, even cheaper if you buy cheaper ones and just add more redundancy. I don't know how often they're doing manual data recovery off failed drives, but if you're having to actually rebuild more than a day's worth of data over 1% of the time when you have drive failures, to me that means you're doing something wrong.
  11. St1ckM4n

    St1ckM4n TechSpot Evangelist Posts: 3,197   +555

    A very interesting read, thanks.
  12. Amazing article.really opened up my eyes on how a false fail or even a real fail of hdds can impact on MDCs like this.
  13. SCJake

    SCJake Newcomer, in training Posts: 80

    @Guest pertaining to RAID1.

    The problem with RAID1 is performance and corruption. Most huge Datacenters use something to the point of RAID5(5+2)+1 aka Each physical server has a RAID5 array with 2 hot spares sitting there waiting to go and every rack/server array/aisle is in "RAID" 5 with a hot server sitting there waiting to accept the load of a failed server

    Granted what you have going on for smaller clients is perfect (oddly similar to what I do for my clients!) but with a large datacenter (200k servers in one room!) that doesnt really cut it. You have to remember that these guys often have 100 clients each with 4-5 offices all coming to this one physical server for their data.
    cliffordcooley likes this.
     
  14. I guess the cockroach moved to a different part of the drive or got squashed by the test and now the drive is fine. Got to wait till the others hatch and it is all over again.
  15. Phr3d

    Phr3d Newcomer, in training Posts: 16

    It would be an interesting experiment - re-think clean-room at the manufacturer AND the data center itself, could it simply be 'bugs' given the aural density that we are using now? A microscopic 'bug' could live, thrive and die in nanoseconds to be washed away by the mechanicals and untraceable at even a forensic level.
    If an entirely new process of clean room was introduced and tested, could the failure rate improve?
    Else, and this seems to be your reference, we just Expect that these failures will occur and agree on a standard to, simplistically speaking, reset the drives every so often and performance aim at, again simplistically, 4 of 5 drives Ever being available at one time in an array.
    The performance hit could be offset by the elimination of reaction-to-fail costs.


Add New Comment

TechSpot Members
Login or sign up for free,
it takes about 30 seconds.
You may also...
Get complete access to the TechSpot community. Join thousands of technology enthusiasts that contribute and share knowledge in our forum. Get a private inbox, upload your own photo gallery and more.