Nvidia assembled the world's 7th fastest supercomputer in one month

mongeese

Posts: 643   +123
Staff
In a nutshell: Nvidia has detailed the assembly process of the Selene supercomputer, which became the world’s seventh fastest supercomputer in June. The entire thing was assembled amid the pandemic in just three and a half weeks with a socially-distanced team of six, plus a handy robot named Trip.

Selene is a rather unique supercomputer. It uses Nvidia’s commercially-available GPU-accelerated DGX SuperPOD architecture, instead of the custom CPU-heavy designs that dominate most of the Top500 list. It ranks second on the Green500 most power-efficient supercomputer list.

In numbers, Selene uses 560 AMD Epyc 7742 CPUs (64 cores each) and 2240 Nvidia A100 GPUs. Its peak theoretical performance is just under 35 thousand teraflops.

Nvidia’s previous supercomputers took months to construct and were extremely difficult to maintain and upgrade. When it came to designing Selene they tried to make it as simple and modular as possible. Each of Selene’s 280 nodes is a standardized DGX pod containing eight Nvidia A100 GPUs and two AMD Epyc CPUs. A handful of pods are stacked in a glorified filing cabinet (just being honest) which are strung together in groups of sixteen to form a SuperPOD.

Selene’s homogeneity is what enabled it to be assembled so quickly. It was mostly a matter of moving each DGX pod into the right spot and wiring it up and checking that it worked. Wiring a supercomputer is always a tricky job (particularly six feet apart) but Nvidia used Mellanox’s InfiniBand switches to reduce the number of cables required while simultaneously increasing bandwidth.

Selene is cooled on a per-SuperPOD basis. All of the SuperPODs reside in one giant air-conditioned warehouse. They’re raised off the ground with fans underneath to push the cool air up into the DGX pods. Nvidia’s tiny assembly team only needed to install the flooring and seal up the SuperPODs to control the flow of air.

Nvidia got creative with the monitoring equipment for Selene. They purchased a little robot called Trip, who can be controlled remotely and wheeled around to observe the goings-on inside Selene. They also built a bot for Slack that sends them notifications when the hardware is misbehaving or when a cable has come loose.

Selene is currently working on about a thousand tasks mostly oriented around AI development and neural network training. Its spare cycles are dedicated to coronavirus research.

Permalink to story.

 
7th fastest. WoW! LOL Who was 7th fastest in the 100 meters at the last Olympics? No one knows
 
If you're in charge of the one of the 1,000 tasks being worked on, what is the difference between having 1/1,000th of a supercomputer vs. any ordinary project that just orders up however many resources it needs from AWS or similar?
 
Well that makes it kind of embarrassing that our team of 8 haven’t been able to install the a new blade array since the pandemic started in March...
 
Back