Nvidia's data center customers are postponing Blackwell chip orders due to overheating and other issues

Skye Jacobs

Posts: 582   +13
Staff
What just happened? Some of Nvidia's top enterprise customers are reportedly delaying orders of the latest Blackwell chip racks due to overheating issues and glitches in chip connectivity. The news has sent ripples through the tech industry and financial markets, with Nvidia's shares experiencing a sharp four-percent decline in early trading.

The Information notes that Blackwell GB200 racks, crucial components in data centers, have exhibited problems during initial deployments. The unprecedented power consumption of these cutting-edge GPUs, with each rack drawing a staggering 120-132kW, is the source of the problem, as the extreme power density has pushed traditional cooling systems to their limits.

Additionally, initial shipments of Blackwell racks revealed interconnect glitches, hampering efficient heat distribution and creating problematic hotspots. The complex multi-chip module design, which integrates two large GPU dies on a single package, further exacerbates the heat management challenges.

As deployments scale, with configurations featuring up to 72 Blackwell chips per rack, these thermal inefficiencies compound dramatically. The current server rack designs have proven insufficient to handle the extreme thermal output, prompting Nvidia to request multiple design modifications from its suppliers. Resolving these issues will likely require a combination of chip-level optimizations, the development of more advanced cooling solutions, and a complete overhaul of server rack infrastructure.

Some of Nvidia's biggest buyers, including Microsoft, Amazon Web Services, Google, and Meta Platforms, have reduced their orders for the Blackwell GB200 racks. These hyperscalers had placed orders worth $10 billion or more for the new technology. The impact of these order reductions could be significant.

For instance, Microsoft had initially planned to install GB200 racks with at least 50,000 Blackwell chips in one of its Phoenix facilities. However, as delays emerged, Microsoft's key partner, OpenAI, requested Nvidia's older generation 'Hopper' chips instead.

Despite these setbacks, how these order reductions ultimately affect Nvidia's sales remains unclear. Other potential buyers for the GB200 server racks may exist, even with the reported issues.

During initial testing with a flagship liquid-cooled server containing 72 of the new chips, Nvidia CEO Jensen Huang denied media reports of overheating problems. In November, Huang also stated that the company was on track to exceed its earlier target of recording several billion dollars in revenue from Blackwell chips in its fourth fiscal quarter.

Nvidia and Amazon have declined to comment on the situation, while Microsoft, Google, and Meta have not yet responded to requests for comment.

Permalink to story:

 
Don't worry folks serial AI hyper, Huang will soon start hyping Rubin, the Blackwell successor that will see power figures rise dramatically again.
 
Why not just put fewer GPUs in each rack in the meantime?
nvidia-blackwell-gb200-nvl72-compute-interconnect-nodes.jpg


Each rack contains 18 Compute nodes and 9 Interconnect nodes and a switch on top. I guess it's the system design. Nv-Link scales up to 576 GPU's.
576/4=144 nodes. 144/18=8 rack's.
8 rack's * 120KW= 960KW.That's almost one MegaWatt. Kookoo!
 
Last edited:
132KW means 450402.74BTU required for a single rack. That means the cooling capacity of 37.5 house chillers at 12000BTU. Just crazy.

And likely still more efficient, since it packs more compute in far more denser space.

The waste heat could be used for other purposes - I used to live in a house that would get it's heating from factory's some miles away (waste heat) and it was pretty much to boiling point coming inside.

It was a water in and water out type of thing, where a heat exchanger would convert the heat into usable heat in my house.

But
 
Each rack contains 18 Compute nodes and 9 Interconnect nodes and a switch on top. I guess it's the system design. Nv-Link scales up to 576 GPU's.
576/4=144 nodes. 144/18=8 rack's.
8 rack's * 120KW= 960KW.That's almost one MegaWatt. Kookoo!

There are also B200 server chassis, with 8 GPUs per chassis. I work at a data center where a customer will be installing and putting into production some of the first, if not the first, B200's.
I think the configuration will be 4 Nvidia servers per cab at around 60kw. The current cooling capacity at my site is probably no more then 75kw per cab.

 
Back