Enterprise SSD flaw bricks drives and renders data unrecoverable after 40,000 hours

Cal Jeffrey

TS Evangelist
Staff member

Hewlett Packard Enterprise (HPE) has issued a critical warning for some of the solid-state drives it uses in a number of its enterprise server and storage products. The “flaw” causes the SSDs to brick at exactly 40,000 hours (4 years, 206 days, 16 hours). HPE warns that this is a catastrophic failure that will render all stored data unrecoverable.

Equipment installed with firmware prior to HPD7 is subject to this issue. So far, these drives should be in working order as most shipped less than five years ago. The company predicts that SAS SSDs that have not been updated should start experiencing failure no earlier than October 2020.

Four specific products have been identified as susceptible to this flaw, including HPE model numbers EK0800JVYPN, EO1600JVYPP, MK0800JVYPQ, and MO1600JVYPR. These are 800GB and 1.6TB drives.

The defect is apparently not isolated to HPE equipment and could affect other OEMs as well. Hewlett Packard says it was notified of the flaw by an unnamed SSD manufacturer, which some have speculated is SanDisk.

It is also not the first glitch of this kind. In January, HPE issued a similar warning for SAS SSDs that would fail after 32,768 hours. That problem had a much broader scope affecting 20 different SKUs.

Administrators should update firmware immediately and contact HPE support if they run into any issues. The firmware version to look for is, as previously mentioned, HPD7. The company has a fixes available for VMWare, Windows, and Linux on its website. It also has documentation and tools for determining the total uptime of affected products.

Masthead credit: Sergiy Palamarchuk via Shutterstock

Permalink to story.

 

umbala

TS Guru
One of the key tests for any PC hardware, is by changing internal clock far backwards and forward, to see that it does not affect the product's functionality.

But who am I to lecture HP, they know better, they just don't bother.
That's not how this issue works. Think of it like an odometer. The flaw is based on how many hours the hardware has been running. Changing the clock so that it's 4 years ahead would make no difference. Example: your SSD has been running for 1000 hours and you change your computer clock ahead by 4 years, it would still show as 1000 hours and wouldn't simply add another 4 years to it.
 

texasrattler

TS Evangelist
Sandisk is owned by Western Digital for the past few yrs. WD is known for server related equipment. Obviously HP had these put in before SD was bought by WD. I guess now we know why you dont use products that arent known for that kind of use. HP didnt get that memo or simply, they cared about the price. Well hopefully this will be a lesson for anyone, you get what you pay for.

I use a SD ssd 1tb. Not a single issue ive seen. Had it over a yr. Granted I am not using it in a server or anything like that. Just for games.
 

jobeard

TS Ambassador
See another article at
 
  • Like
Reactions: Cal Jeffrey
I'd be interested to know if there were a utility that one could run to see if their SSD drive is affected. Would certainly be helpful to those who want to know if they are affected in case their particular OEM re-brand decides not to make a patch available.
 
  • Like
Reactions: hk2000 and Capaill

sac39507

TS Maniac
Enterprise environments have the adequate backups in place so they don't have to worry (at least they are supposed to and if not, fire their IT people)
 

Scshadow

TS Evangelist
Enterprise environments have the adequate backups in place so they don't have to worry (at least they are supposed to and if not, fire their IT people)
And? I fail to see a point provided here. First off, you should have offline backups and offsite backups. But I still wouldn't want to see an storage array go down. What if several of the same drives were installed at the same time? You may not have enough parity data to recover if the failures are really close together. I wouldn't want to have to recover from an offline or offsite backup to restore data. Thats time consuming. It would be better to just fix the firmware.
 
  • Like
Reactions: PEnnn

trparky

TS Evangelist
What exactly happens at 40,000 hours? Does the firmware attempt to apply some sort of maintenance that had a huge bug in it? Curious minds want to know...
It probably has something to do with some kind of internal timer where if the incremented number exceeds the capacity of its storage location it bricks the firmware because it doesn't know how to handle it. Think of it as a fatal exception or BSOD for the firmware.
 

sac39507

TS Maniac
And? I fail to see a point provided here. First off, you should have offline backups and offsite backups. But I still wouldn't want to see an storage array go down. What if several of the same drives were installed at the same time? You may not have enough parity data to recover if the failures are really close together. I wouldn't want to have to recover from an offline or offsite backup to restore data. Thats time consuming. It would be better to just fix the firmware.
The point is that it's not a total data loss situation if proper backups are in place. Of course they should implement the fix ASAP to avoid all the headaches of recovering from backup. I guess my wording of "they shouldn't worry" painted the wrong picture. They should worry because of all the labor and problems it can cause.

Why are you even mentioning data parity and recovery? I'm not even talking about rebuilding a bad array but rather a full recovery from backup to freshly implemented healthy array. I get your point but don't understand why you didn't get my clear and obvious one.
 

brucek

TS Guru
It probably has something to do with some kind of internal timer where if the incremented number exceeds the capacity of its storage location it bricks the firmware because it doesn't know how to handle it. Think of it as a fatal exception or BSOD for the firmware.
Maybe, except 40,000 hours sounds suspiciously like a human-defined threshold, not a computer one (it's not a power of two, nor is it if you multiply by 60 to get seconds.)
 

Darth Shiv

TS Evangelist
What exactly happens at 40,000 hours? Does the firmware attempt to apply some sort of maintenance that had a huge bug in it? Curious minds want to know...
Numeric overflow... stops the firmware from being able to operate I presume. Meaning the data should still be intact on the drive just some intervention required to unbrick the drive.
 

Ben Myers

TS Booster
I like umbala's analogy with a car's odometer. Here is how the article needed to explain:

HPE SSD's, like all modern SSDs and hard drives, have SMART (Self-Monitoring, Analysis, and Reporting Technology) built into the drive firmware. Part of SMART is a counter that measures the number of hours when a drive has been powered on. For whatever reason, when this counter hits 40,000 hours of metered use, the drive bricks.

QUICK! If you have HPE drives, go to the HP website and see if there is a firmware update. Between which dates were these drives placed in service?