The first exascale supercomputer has a hardware failure every day

mongeese · Oct 9, 2022

In brief: Frontier, the world's most powerful supercomputer, is online but still far from operational. Its director has confirmed reports that it is experiencing a system failure every few hours, but insists that's par for the course.

Frontier is in a class of its own. It has 9,408 HPE Cray EX235a nodes, each powered by an AMD Trento 7A53 Epyc 64-core CPU equipped with 512 GB of DDR4, and four AMD Instinct MI250X GPUs / accelerators each equipped with 128 GB of HBM2e. Summed, the system has 602,112 CPU cores and 8,138,240 GPU cores in total, and 4.6 PB of both DDR4 and HBM2e.

In May, Frontier joined the TOP500 as the first supercomputer to break the exascale barrier after it completed the HPL benchmark with a score of 1.102 ExaFlops/s. Since then, the Oak Ridge National Laboratory in Tennessee, which manages the supercomputer, has been readying it for scientific research scheduled to start in January.

However, there have been reports that the launch of Frontier could be waylaid by excessive hardware failures. Seeking answers, Inside HPC organized an interview with the Program Director at Oak Ridge, Justin Whitt. In the interview, he confirmed Frontier was experiencing daily system failures but asserted that was inevitable in such a large system.

"Mean time between failure on a system this size is hours, it's not days," he said. "So you need to make sure you understand what those failures are and that there's no patterns to those failures that you need to be concerned with." Whitt added that going a day without a failure "would be outstanding."

"Our goal is still hours."

says Justin Whitt, Program Director at the OLCF

There were rumors that the hardware problems were being caused by the new AMD Instinct MI250X, but Whitt refuted them. The MI250X is AMD's most powerful GPU/accelerator, and it only sells it to select partners. It has 220 CUs containing 14,080 cores clocked at 1700 MHz in a 500 W package.

"The issues span a lot of different categories, the GPUs are just one," Whitt remarked. "It's been a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products," he added.

"We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."

Whitt conceded that the unprecedented scale of Frontier had made fine tuning it "a little bit harder" but said they were still following the schedule set back in 2018-19 despite delays caused by the pandemic.

Head over to Inside HPC to read the full interview.

Permalink to story.

https://www.techspot.com/news/96254-first-exascale-supercomputer-has-hardware-failure-every-day.html

Tom Yum · Oct 9, 2022

In the one hand it makes sense, even very low probability failures become probable when you have so many of them. This is why fault tolerant design is important, designing a system so it can have hardware failures and continue to operate.

It isn't clear from the article whether the hardware failures they are talking about are tolerable failures (failures that don't stop the machine from performing at its designed specifications) or not. If the former then the only impact is higher maintenance/running costs. If the latter then that is bad, and would indicate more of a design failure.

Bullwinkle M · Oct 9, 2022

Sounds normal

It took a few years for my 5 - Windows XP machines to settle down and run reliably

I went through bad power supplies, bad RAM chips, bad USB cables, bad SATA cables, hard drives, monitors, motherboards....etc

Now, they have been running reliably for several years without constant problems

My primary XP box has been online since 2014 without hardware problems or malware problems

Still running Service Pack 2 in a full Admin Acount without any Microsoft Security Updates online

No malware, ransomware or wiper problems for the past 8 years and no hardware failures for the past 3 years on any of them...........and ya, I use them to study malware!

human7 · Oct 10, 2022

The cluster I work with at work seems to have a hard drive failure at least every week, sometimes more often, so this doesn't surprise me. (Naturally, my company doesn't buy SSDs in order to save money, which perhaps they do, but the maintenance costs seem to offset any advantage.)

msroadkill612 · Oct 10, 2022

Bullwinkle M said:
Sounds normal

It took a few years for my 5 - Windows XP machines to settle down and run reliably

I went through bad power supplies, bad RAM chips, bad USB cables, bad SATA cables, hard drives, monitors, motherboards....etc

Now, they have been running reliably for several years without constant problems

My primary XP box has been online since 2014 without hardware problems or malware problems

Still running Service Pack 2 in a full Admin Acount without any Microsoft Security Updates online

No malware, ransomware or wiper problems for the past 8 years and no hardware failures for the past 3 years on any of them...........and ya, I use them to study malware!

You have wipers on ur windows?

kira setsu · Oct 10, 2022

Those AMD drivers are at it again!!!

I'm joking, but it must be painful to track down issues in a system that big tho.

Puiu · Oct 10, 2022

Bleeding edge technology has always been like this. Eventually they'll figure it out.

Raytrace3D · Oct 10, 2022

Bullwinkle M said:
Sounds normal

It took a few years for my 5 - Windows XP machines to settle down and run reliably

I went through bad power supplies, bad RAM chips, bad USB cables, bad SATA cables, hard drives, monitors, motherboards....etc

Now, they have been running reliably for several years without constant problems

My primary XP box has been online since 2014 without hardware problems or malware problems

Still running Service Pack 2 in a full Admin Acount without any Microsoft Security Updates online

No malware, ransomware or wiper problems for the past 8 years and no hardware failures for the past 3 years on any of them...........and ya, I use them to study malware!

Are these connected to the internet?

Bullwinkle M · Oct 10, 2022

Raytrace3D said:
Are these connected to the internet?

Yes, of course

Malware of all types have failed to damage my XP machines in any way for the past 8 years........ while doing malware research online

Puiu · Oct 11, 2022

Bullwinkle M said:
Yes, of course

Malware of all types have failed to damage my XP machines in any way for the past 8 years........ while doing malware research online

malware, over the decades, has become incompatible with the old OS

Bullwinkle M · Oct 11, 2022

Puiu said:
malware, over the decades, has become incompatible with the old OS

and yet, my XP machines remain invulnerable to all the compatible versions of malware like wannacry and every other form of ransomware / trojan / wiper / rootkits and BIOS attack

Avro Arrow · Oct 11, 2022

This is interesting because I had no idea that supercomputers went through such massive teething problems. On a computer that size however, it makes sense that something will go wrong because there are so many parts involved that just the law of averages for defective parts makes it inevitable.

human7 · Oct 11, 2022

Avro Arrow said:
This is interesting because I had no idea that supercomputers went through such massive teething problems. On a computer that size however, it makes sense that something will go wrong because there are so many parts involved that just the law of averages for defective parts makes it inevitable.

Indeed. Large supercomputers need ECC memory or they can't even boot (or stay working very long afterwards if they manage to boot): the RAM has so much surface area that there are too many bit errors due to comsic rays for the computer to stay functional. This article (which I thought used to be free but apparently is behind a paywall now) describes that: https://spectrum.ieee.org/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder

Avro Arrow · Oct 11, 2022

human7 said:
Indeed. Large supercomputers need ECC memory or they can't even boot (or stay working very long afterwards if they manage to boot): the RAM has so much surface area that there are too many bit errors due to comsic rays for the computer to stay functional. This article (which I thought used to be free but apparently is behind a paywall now) describes that: https://spectrum.ieee.org/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder

Maybe they should try lining the building in which it sits with lead, iron and/or concrete.

Fastturtle · Oct 12, 2022

As others have already said, I'm not surprised they're having teething issues simply due to how complex this system is. Simply put, if you want the bleeding edge, you will bleed all the time from it. If you want stability, then you do like the Pentagon does and stick with 386 cpus for much of your hardware.

The first exascale supercomputer has a hardware failure every day

mongeese

Posts: 643 +123

In brief: Frontier, the world's most powerful supercomputer, is online but still far from operational. Its director has confirmed reports that it is experiencing a system failure every few hours, but insists that's par for the course.

Tom Yum

Posts: 231 +539

Bullwinkle M

Posts: 911 +817

human7

Posts: 440 +429

msroadkill612

Posts: 112 +43

kira setsu

Posts: 740 +822

Puiu

Posts: 6,450 +5,644

Raytrace3D

Posts: 396 +473

Bullwinkle M

Posts: 911 +817

Puiu

Posts: 6,450 +5,644

Bullwinkle M

Posts: 911 +817

Avro Arrow

Posts: 3,721 +4,821

human7

Posts: 440 +429

Avro Arrow

Posts: 3,721 +4,821

Fastturtle

Posts: 145 +79

Similar threads

Latest posts