The first exascale supercomputer has a hardware failure every day

mongeese

Posts: 643   +123
Staff
In brief: Frontier, the world's most powerful supercomputer, is online but still far from operational. Its director has confirmed reports that it is experiencing a system failure every few hours, but insists that's par for the course.

Frontier is in a class of its own. It has 9,408 HPE Cray EX235a nodes, each powered by an AMD Trento 7A53 Epyc 64-core CPU equipped with 512 GB of DDR4, and four AMD Instinct MI250X GPUs / accelerators each equipped with 128 GB of HBM2e. Summed, the system has 602,112 CPU cores and 8,138,240 GPU cores in total, and 4.6 PB of both DDR4 and HBM2e.

In May, Frontier joined the TOP500 as the first supercomputer to break the exascale barrier after it completed the HPL benchmark with a score of 1.102 ExaFlops/s. Since then, the Oak Ridge National Laboratory in Tennessee, which manages the supercomputer, has been readying it for scientific research scheduled to start in January.

However, there have been reports that the launch of Frontier could be waylaid by excessive hardware failures. Seeking answers, Inside HPC organized an interview with the Program Director at Oak Ridge, Justin Whitt. In the interview, he confirmed Frontier was experiencing daily system failures but asserted that was inevitable in such a large system.

"Mean time between failure on a system this size is hours, it's not days," he said. "So you need to make sure you understand what those failures are and that there's no patterns to those failures that you need to be concerned with." Whitt added that going a day without a failure "would be outstanding."

"Our goal is still hours."

There were rumors that the hardware problems were being caused by the new AMD Instinct MI250X, but Whitt refuted them. The MI250X is AMD's most powerful GPU/accelerator, and it only sells it to select partners. It has 220 CUs containing 14,080 cores clocked at 1700 MHz in a 500 W package.

"The issues span a lot of different categories, the GPUs are just one," Whitt remarked. "It's been a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products," he added.

"We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."

Whitt conceded that the unprecedented scale of Frontier had made fine tuning it "a little bit harder" but said they were still following the schedule set back in 2018-19 despite delays caused by the pandemic.

Head over to Inside HPC to read the full interview.

Permalink to story.

 
In the one hand it makes sense, even very low probability failures become probable when you have so many of them. This is why fault tolerant design is important, designing a system so it can have hardware failures and continue to operate.

It isn't clear from the article whether the hardware failures they are talking about are tolerable failures (failures that don't stop the machine from performing at its designed specifications) or not. If the former then the only impact is higher maintenance/running costs. If the latter then that is bad, and would indicate more of a design failure.
 
Sounds normal

It took a few years for my 5 - Windows XP machines to settle down and run reliably

I went through bad power supplies, bad RAM chips, bad USB cables, bad SATA cables, hard drives, monitors, motherboards....etc

Now, they have been running reliably for several years without constant problems

My primary XP box has been online since 2014 without hardware problems or malware problems

Still running Service Pack 2 in a full Admin Acount without any Microsoft Security Updates online

No malware, ransomware or wiper problems for the past 8 years and no hardware failures for the past 3 years on any of them...........and ya, I use them to study malware!
 
The cluster I work with at work seems to have a hard drive failure at least every week, sometimes more often, so this doesn't surprise me. (Naturally, my company doesn't buy SSDs in order to save money, which perhaps they do, but the maintenance costs seem to offset any advantage.)
 
Sounds normal

It took a few years for my 5 - Windows XP machines to settle down and run reliably

I went through bad power supplies, bad RAM chips, bad USB cables, bad SATA cables, hard drives, monitors, motherboards....etc

Now, they have been running reliably for several years without constant problems

My primary XP box has been online since 2014 without hardware problems or malware problems

Still running Service Pack 2 in a full Admin Acount without any Microsoft Security Updates online

No malware, ransomware or wiper problems for the past 8 years and no hardware failures for the past 3 years on any of them...........and ya, I use them to study malware!
You have wipers on ur windows?
 
Sounds normal

It took a few years for my 5 - Windows XP machines to settle down and run reliably

I went through bad power supplies, bad RAM chips, bad USB cables, bad SATA cables, hard drives, monitors, motherboards....etc

Now, they have been running reliably for several years without constant problems

My primary XP box has been online since 2014 without hardware problems or malware problems

Still running Service Pack 2 in a full Admin Acount without any Microsoft Security Updates online

No malware, ransomware or wiper problems for the past 8 years and no hardware failures for the past 3 years on any of them...........and ya, I use them to study malware!
Are these connected to the internet?
 
malware, over the decades, has become incompatible with the old OS :)
and yet, my XP machines remain invulnerable to all the compatible versions of malware like wannacry and every other form of ransomware / trojan / wiper / rootkits and BIOS attack
 
This is interesting because I had no idea that supercomputers went through such massive teething problems. On a computer that size however, it makes sense that something will go wrong because there are so many parts involved that just the law of averages for defective parts makes it inevitable.
 
This is interesting because I had no idea that supercomputers went through such massive teething problems. On a computer that size however, it makes sense that something will go wrong because there are so many parts involved that just the law of averages for defective parts makes it inevitable.
Indeed. Large supercomputers need ECC memory or they can't even boot (or stay working very long afterwards if they manage to boot): the RAM has so much surface area that there are too many bit errors due to comsic rays for the computer to stay functional. This article (which I thought used to be free but apparently is behind a paywall now) describes that: https://spectrum.ieee.org/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder
 
Indeed. Large supercomputers need ECC memory or they can't even boot (or stay working very long afterwards if they manage to boot): the RAM has so much surface area that there are too many bit errors due to comsic rays for the computer to stay functional. This article (which I thought used to be free but apparently is behind a paywall now) describes that: https://spectrum.ieee.org/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder
Maybe they should try lining the building in which it sits with lead, iron and/or concrete.
 
As others have already said, I'm not surprised they're having teething issues simply due to how complex this system is. Simply put, if you want the bleeding edge, you will bleed all the time from it. If you want stability, then you do like the Pentagon does and stick with 386 cpus for much of your hardware.
 
Back