Nvidia's data center customers are postponing Blackwell chip orders due to overheating and other issues

Skye Jacobs · Jan 14, 2025

What just happened? Some of Nvidia's top enterprise customers are reportedly delaying orders of the latest Blackwell chip racks due to overheating issues and glitches in chip connectivity. The news has sent ripples through the tech industry and financial markets, with Nvidia's shares experiencing a sharp four-percent decline in early trading.

The Information notes that Blackwell GB200 racks, crucial components in data centers, have exhibited problems during initial deployments. The unprecedented power consumption of these cutting-edge GPUs, with each rack drawing a staggering 120-132kW, is the source of the problem, as the extreme power density has pushed traditional cooling systems to their limits.

Additionally, initial shipments of Blackwell racks revealed interconnect glitches, hampering efficient heat distribution and creating problematic hotspots. The complex multi-chip module design, which integrates two large GPU dies on a single package, further exacerbates the heat management challenges.

As deployments scale, with configurations featuring up to 72 Blackwell chips per rack, these thermal inefficiencies compound dramatically. The current server rack designs have proven insufficient to handle the extreme thermal output, prompting Nvidia to request multiple design modifications from its suppliers. Resolving these issues will likely require a combination of chip-level optimizations, the development of more advanced cooling solutions, and a complete overhaul of server rack infrastructure.

Some of Nvidia's biggest buyers, including Microsoft, Amazon Web Services, Google, and Meta Platforms, have reduced their orders for the Blackwell GB200 racks. These hyperscalers had placed orders worth $10 billion or more for the new technology. The impact of these order reductions could be significant.

For instance, Microsoft had initially planned to install GB200 racks with at least 50,000 Blackwell chips in one of its Phoenix facilities. However, as delays emerged, Microsoft's key partner, OpenAI, requested Nvidia's older generation 'Hopper' chips instead.

Despite these setbacks, how these order reductions ultimately affect Nvidia's sales remains unclear. Other potential buyers for the GB200 server racks may exist, even with the reported issues.

During initial testing with a flagship liquid-cooled server containing 72 of the new chips, Nvidia CEO Jensen Huang denied media reports of overheating problems. In November, Huang also stated that the company was on track to exceed its earlier target of recording several billion dollars in revenue from Blackwell chips in its fourth fiscal quarter.

Nvidia and Amazon have declined to comment on the situation, while Microsoft, Google, and Meta have not yet responded to requests for comment.

Permalink to story:

Nvidia's data center customers are postponing Blackwell chip orders due to overheating and other issues

yRaz · Jan 14, 2025

The bubble is getting dangerously close to popping. They need more data, not more power.

godrilla · Jan 14, 2025

Make sure to look out for open test benches vs closed cases when looking at 5090 throttle monster reviews for potential performance loss in real world scenarios.

takaozo · Jan 14, 2025

132KW means 450402.74BTU required for a single rack. That means the cooling capacity of 37.5 house chillers at 12000BTU. Just crazy.

zamroni111 · Jan 14, 2025

takaozo said:
132KW means 450402.74BTU required for a single rack. That means the cooling capacity of 37.5 house chillers at 12000BTU. Just crazy.

how many wells of water needed to cool the Blackwell?

mosu · Jan 14, 2025

It was to be expected.

Mr Majestyk · Jan 14, 2025

Don't worry folks serial AI hyper, Huang will soon start hyping Rubin, the Blackwell successor that will see power figures rise dramatically again.

ScottSoapbox · Jan 14, 2025

Why not just put fewer GPUs in each rack in the meantime?

takaozo · Jan 14, 2025

ScottSoapbox said:
Why not just put fewer GPUs in each rack in the meantime?

nvidia-blackwell-gb200-nvl72-compute-interconnect-nodes.jpg

Each rack contains 18 Compute nodes and 9 Interconnect nodes and a switch on top. I guess it's the system design. Nv-Link scales up to 576 GPU's.
576/4=144 nodes. 144/18=8 rack's.
8 rack's * 120KW= 960KW.That's almost one MegaWatt. Kookoo!

Burty117 · Jan 15, 2025

takaozo said:
132KW means 450402.74BTU required for a single rack. That means the cooling capacity of 37.5 house chillers at 12000BTU. Just crazy.

Thanks for crunching the numbers, that is insane!

Vanderlinde · Jan 15, 2025

takaozo said:
132KW means 450402.74BTU required for a single rack. That means the cooling capacity of 37.5 house chillers at 12000BTU. Just crazy.

And likely still more efficient, since it packs more compute in far more denser space.

The waste heat could be used for other purposes - I used to live in a house that would get it's heating from factory's some miles away (waste heat) and it was pretty much to boiling point coming inside.

It was a water in and water out type of thing, where a heat exchanger would convert the heat into usable heat in my house.

But

wiyosaya · Jan 15, 2025

What? Data centers do not want space heaters? What is the world coming to.

Crabjuice · Jan 15, 2025

takaozo said:
Each rack contains 18 Compute nodes and 9 Interconnect nodes and a switch on top. I guess it's the system design. Nv-Link scales up to 576 GPU's.
576/4=144 nodes. 144/18=8 rack's.
8 rack's * 120KW= 960KW.That's almost one MegaWatt. Kookoo!

There are also B200 server chassis, with 8 GPUs per chassis. I work at a data center where a customer will be installing and putting into production some of the first, if not the first, B200's.
I think the configuration will be 4 Nvidia servers per cab at around 60kw. The current cooling capacity at my site is probably no more then 75kw per cab.

Nvidia's data center customers are postponing Blackwell chip orders due to overheating and other issues

Skye Jacobs

Posts: 582 +13

yRaz

Posts: 7,874 +13,119

godrilla

Posts: 1,852 +1,285

takaozo

Posts: 1,077 +1,641

zamroni111

Posts: 555 +303

mosu

Posts: 606 +244

Mr Majestyk

Posts: 2,420 +2,251

ScottSoapbox

Posts: 1,823 +3,314

takaozo

Posts: 1,077 +1,641

Burty117

Posts: 5,548 +4,384

Vanderlinde

Posts: 642 +434

wiyosaya

Posts: 10,500 +11,131

Crabjuice

Posts: 99 +172

Similar threads

Latest posts