Next-gen Nvidia GPUs will power the Big Red 200 supercomputer by fall

mongeese

Posts: 643   +123
Staff
In a nutshell: Indiana University has announced that its Big Red 200 supercomputer will be upgraded with new, unreleased Nvidia hardware by fall this year. Consider the Ampere architecture confirmed.

The Big Red 200 is a new $9.6 million supercomputer being built for Indiana University. It was originally designed to use Tesla V100 GPUs, but at the last minute, Nvidia offered to upgrade it with the V100’s successor, probably called the Tesla A100. The upgrade brought its theoretical FP64 performance up from 6 PetaFLOPS to just under 8 PetaFLOPS.

The new GPUs won’t arrive in time for Big Red’s original schedule, so Indiana University has split its construction into two phases. The first is the CPU powered half. Built on the Cray Shasta architecture, each of 672 nodes will have two 64-core AMD Epyc 7742 processors. That’s 128 cores per node, and 86,016 cores in total. The first phase is expected to come online soon.

The second phase will introduce 256 of Nvidia’s new “Tensor Core GPUs” in additional nodes. These nodes are described as being “architecturally similar” to the Perlmutter’s (another supercomputer). If that’s accurate, then each node will have four GPUs and one AMD Epyc 7742. This would bring Big Red’s total CPU core count to 90,112.

"The combination of the AMD Rome CPUs and the next-generation NVIDIA Tensor Core GPUs are well-matched to the needs of IU researchers for simulation, AI, and many forms of research” - Brad Wheeler

Intriguingly, Indiana University’s CIO Brad Wheeler has described the new GPUs as being 70-75% more powerful than a Tesla V100. The V100 has between 7 and 8.2 TeraFLOPS at FP64, implying that this new card could have 13±1 TeraFLOPS.

It’s also possible to work backward from a total system performance bracket of ~8 PetaFLOPS. Doing so yields a per GPU theoretical performance bracket of 14.5±0.8 TeraFLOPS.

Both those estimates would be too unreliable to be worth mentioning if they weren’t so close to each other. As is though, they’re some food for thought.

Permalink to story.

 
I wonder if Nvidia are planning to switch back to shader unit structure used in Volta - the GV100 chip has 32 FP64 units per SM, whereas the TU102 has just 2 FP64 units per SM. This is why the Tesla V100 can churn out 7 TFLOPs of FP64 FMA calculations (1380 MHz boost clock x 80 SMs x 32 FP64 units x 2 ops in an FMA calc) and the Quadro RTX 8000 can only do 0.51 TFLOPs (1770 MHz x 72 SMs x 2 FP64 units x 2 ops).

To get the RTX architecture up to 14 TFLOPs, for the same boost clock it currently has, would need something like 120-odd SMs with 32 FP64 units per SM, or some variant thereof. This would be an almighty increase in die size, even on TSMC's 7nm process node. Or, it could be a case of no FP64 units at all, just more SMs/FP32 units and they're combined in the same way that the FP32 SIMD units in the current Navi GPUs are to do FP64 calcs. But then you'd need almost double the number of FP32 units found in the TU102 to go from 16 TFLOPs FP32 rate to 14 TFLOPs FP64 rate (at a ratio of 1:2). Again, you end up with another huge chip.

It would have been nice to have known exactly what nextplaform.com would referring to when they said "[t]he newer silicon is expected to deliver 70 percent to 75 percent more performance than that of the current generation."

Edit: A thought just occurred to me. Maybe Wheeler was referring to comparing the new in-coming Tesla cards versus the current ones, which as still Volta GV100 models. So the performance quote might be about that, and not new architecture versus Turing.

Yeah? But can it play DOOM? If it can't play doom it's hardly worth using! ;)
Play it? Sure! No visual outputs on those V100 cards though, and the replacement will be same - so nobody gets to see it. Of course, it could run a wee bit of code to write the output in ASCI :)
 
Last edited:
It would have been nice to have known exactly what nextplaform.com would referring to when they said "[t]he newer silicon is expected to deliver 70 percent to 75 percent more performance than that of the current generation."

Edit: A thought just occurred to me. Maybe Wheeler was referring to comparing the new in-coming Tesla cards versus the current ones, which as still Volta GV100 models. So the performance quote might be about that, and not new architecture versus Turing.
I think most data center people view Turing and Volta as the same generation. From an architectural standpoint, the two have more in common than they do with Pascal. A Turing SM just swaps out the 32 FP64 cores in a GV100 SM for an extra 32 FP32 ones. They also have the same INT32 core configuration.

I suspect that the Ampere Tesla card could use a similar structure to the GP100, with 64 FP64 cores per SM. They might go for an equal distribution of cores, ie 64 FP64, 64 FP32, 64 INT32. Or they could reduce the number of INT32 cores from Volta, and have 64 FP64, 32 FP32, 32 INT32 (preserving the 128 total they have an affinity for).

Either would probably result in a 70% FP64 increase with the same 72-84 SM range, depending on how Nvidia want to set that up, and clock speeds.

Edit: There's also an interesting leak by @/corgikitty on Twitter. Don't know about reliability, but they have some GA10x architecture diagrams. They say consumer Ampere has 64 INT32 cores and two 64 FP32 core pipelines per SM, so 128 FP32 cores total per SM. This would probably mean that the Ampere Tesla chip would have the equal distribution of core types.
 
Last edited:
Edit: There's also an interesting leak by @/corgikitty on Twitter. Don't know about reliability, but they have some GA10x architecture diagrams. They say consumer Ampere has 64 INT32 cores and two 64 FP32 core pipelines per SM, so 128 FP32 cores total per SM. This would probably mean that the Ampere Tesla chip would have the equal distribution of core types.
This is true for Turing now, so I'd be surprised if Nvidia moved away from that layout for Ampere. It could well have larger SMs (I.e. 8 blocks of 32 FP/INT units, compared to the current 4) but fewer in total. However, this would lead to a decrease in warp occupancy of the SMs in current GPU usage found in games, leading to a lower level of efficiency.

Looking at corgikitty's Twitter feed, the following diagram is shown:

EOkhxsdUwAE4blQ


If we assume that this is genuine, then there's not a significant change from Turing:

2019-07-13-image-5.png


You can see that the number of FP32 and Tensor cores have both doubled, whereas the number INT32 units remains the same, which is a sensible choice. L1 cache has increased a touch, but I'm surprised (again, assuming the Twitter image is genuine) that the register file hasn't increased in size.
 
672*2*64*2.25*16=3.096576PFlops/s
+
64*1*64*2.25*16=0.147456PFlops/s
+
64*4*1*8.2*(1+0.75)=3.6736PFlops/s

Altogether = 6.92PF/s.
Still less than 8PF/s
 
Back