Nvidia RTX 5000 cards show PCB hotspots that threaten longevity, says Igor's Lab

midian182

Posts: 10,828   +142
Staff member
A hot potato: It's not like Nvidia's RTX 5000 series needs any more negative press, but here we are again. Igor Wallossek of Igor's Lab has discovered a problem that appears to be present in most Blackwell AIB partner cards: local hotspots at the rear of the PCBs. This could potentially lead to the cards being damaged over time due to heavy use.

During a sustained "torture loop" on a PNY RTX 5070 OC and Palit RTX 5080 Gaming Pro OC, Wallossek recorded temperature spikes in the power delivery areas of the cards. The RTX 5070 reached 107 °C while the GPU core sat at a much cooler 70°C, and the RTX 5080 Gaming Pro OC peaked at 80.5°C.

Wallossek writes that the problem lies with the tightly clustered arrangement of FETs, chokes, drivers and via arrays in the area that funnel hundreds of amps through a postage-stamp-sized patch of copper layers. PCB traces only 35 – 70 µm thick must share the load through stacked power planes, concentrating heat vertically and laterally.

While the power delivery components have been placed this close together to keep the card design as small as possible, the increased temperatures could have a severe impact on the cards' long-term durability – they could last only a few years with heavy use.

Wallossek also points to shortcomings in Nvidia's Thermal Design Guide, made available to all manufacturers of AIB cards and components. Key parameters in the guide are validated under ideal lab conditions rather than worst-case, high-humidity, poorly ventilated chassis scenarios that real gamers endure.

Igor's quick-and-dirty fix – adding a strip of thermal putty and a thicker pad to bridge the hotspot to the backplate – led to significant temperature reductions. The RTX 5080's hotspot dropped from 80.5°C to 70.3°C, while the RTX 5070 dropped from 107.3°C to "well below 95°C," which is still quite hot.

Igor's measurements show VRM zones flirting with electromigration onset of around 80°C or even exceeding the glass-transition temperature of standard PCB resin of around 105 °C during heavy gaming loads.

Nvidia has not yet responded to the reports. Several engineers posting in Igor's Lab's comment thread suggest the company relies heavily on partner self-certification, with GPU and memory sensors the only metrics checked in its Green Light program – an internal compliance-and-certification system that every AIB partner must pass before it can ship a GeForce graphics card.

Permalink to story:

 
You know, as Nvidia gets worst and worst in QC we can probably disregard just the individual vendors and partners as the issue with their design. It's also lingering throughout multiple generations so there was time to fix this issues.

What I think is that Nvidia is probably now designing their GPUs exclusively for a data center environment where extremely effective but extremely loud cooling solutions are the norm. So kinda just mostly uneducated conjecture on my end I know but I'm thinking Nvidia's PCB design (What seems to be potentially at fault here) is designed with a brute-force cooling solution in mind and then just stepped down in terms of total power used and stricter safeguards like more aggressive underclocking and undervolting and just ship that to consumers.

It's why consumer cards are degrading and exhibiting all these problems already, if you assume that they were initially conceived as just server products that get shoved into next to no airflow situation or worst, a PCB designed created ground up to be water cooled and then just trying to somehow manage it on air and (Compared to a server) basically no airflow.
 
The explanation why Nvidia "removed" the Hot Spot sensor from their RTX 50 GPUs.

I guess, if something is not looking good, you better don't pay attention anymore :)

Same goes for the heat and the amp headroom on the 12VHPWR connector pins when going 600w+ with the 5090. If you do not own the ASUS Astral 5090, you are completely in the dark with what is going on with your 5090, the load balance and the heat on the pins and wires - unless, of course, something is burning, then you know instantly ;-)

Blackwell is really BY FAR the worst and most rushed out Nvidia GPU gen ever.
 
I think this issue started surfacing with the introduction of Ampere. It’s just getting worst as Nvidia attempts to make the board more compact and with rising power requirements. But of course, this is good for them since components on the card ages faster and will need to be replaced faster.
 
I didn't totally get it. Is this a worst case scenario, or should we be worried by regular daily use?

It will likely break in the future, and won't be as lasting as some cards do with a more then 10 year of a lifespan. In short, capacitors design ratings are based on temperatures. The hotter the shorter. If you constantly run caps at or beyond their most ideal rating (I.e 60 degrees and such) they will last shorter then caps running at 30 ~ 40 degrees.

VRM's can run into the hundreds of degrees, before the cards start to cap in speeds or power consumption. But healthy wise is a different story. I think the whole design of Nvidia is like a apple iphone design - super complex, expensive and difficult. It benefits Nvidia yes but not the end consumer.

If your worried about it, ramp up manual fan speeds, and set it to for example, 55% up to 75% manual mode while gaming.
 
It will likely break in the future, and won't be as lasting as some cards do with a more then 10 year of a lifespan. In short, capacitors design ratings are based on temperatures. The hotter the shorter. If you constantly run caps at or beyond their most ideal rating (I.e 60 degrees and such) they will last shorter then caps running at 30 ~ 40 degrees.

VRM's can run into the hundreds of degrees, before the cards start to cap in speeds or power consumption. But healthy wise is a different story. I think the whole design of Nvidia is like a apple iphone design - super complex, expensive and difficult. It benefits Nvidia yes but not the end consumer.

If your worried about it, ramp up manual fan speeds, and set it to for example, 55% up to 75% manual mode while gaming.
All vendors / AIB's want to screw you, the customer, by making or taking shortcuts in the design process to "cost save" on your behalf. The real R&D is just a distant dream in 2025, all company's have losers' with business manuals in their brain and fired the real R&D people.
 
All vendors / AIB's want to screw you, the customer, by making or taking shortcuts in the design process to "cost save" on your behalf. The real R&D is just a distant dream in 2025, all company's have losers' with business manuals in their brain and fired the real R&D people.

Both AMD, Intel and Nvidia always create a reference design. Sort of blueprint on the PCB which AIB's can use. Some improve them, some just leave it as it is and slap their own sticker on it, or some skimp out on components such as VRM's, Capacitors and chokes to either make a little bit of profit per sold card or to be cheaper since they can't get the discounts the larger AIB's get.

Now there's nothing wrong with a card that has lower grade VRM components - it will still run at stock speeds and it will still do everything a reference design does, but things like OC'ing could be made a bit more difficult since the provided voltage on higher speeds or clocks might be a bit more with noise / ripple or disturbance.

Since OC'ing is dead anyway I'd say just grab the card that your good with and call it a day. I always run my cards at a manual fan speed and which keeps things cool below 70 degrees. In my understanding it helps remain it's lifetime of the components or card, not that I'm going to use them for 10+ years, but it's always a peace of mind knowing this card won't die due to temperatures.
 
Since OC'ing is dead anyway I'd say just grab the card that your good with and call it a day. I always run my cards at a manual fan speed and which keeps things cool below 70 degrees. In my understanding it helps remain it's lifetime of the components or card
Also undervolting is always a good idea for longevity. And applying a PL on your GPU way under 100, if you don't need all the horsepower of the card (you can always increase this over the years, when game requirements are rising).

Also, cases like the Antec C8 or Fractal Design Torrent (or any case with lots of unobstructed bottom intake options directly to the GPU) will greatly help to keep the hardware healthy for as long as possible.
 
The whole reason why AMD, Intel or Nvidia apply higher voltages then what seems to be really needed, is to guarantee absolute working. Now in your or my environment things might be cool and all that, but in others with zero case flow those cards run hot, and hot chips need a higher voltage to keep up with stable functionality. They are designed to withstand to operate at 80 degrees or higher - and that's where that higher voltage margin is coming from.

When chips run cooler, they need or consume less power. When chips run hot, they actually consume more power. This is why there's overhead once you start dialing down voltages here and there and the chip seems to run faster out of the box. Same applies for CPU's. When you cool them sufficient they actually need lower voltages then what's the standard.
 
The explanation why Nvidia "removed" the Hot Spot sensor from their RTX 50 GPUs.

I guess, if something is not looking good, you better don't pay attention anymore :)

Same goes for the heat and the amp headroom on the 12VHPWR connector pins when going 600w+ with the 5090. If you do not own the ASUS Astral 5090, you are completely in the dark with what is going on with your 5090, the load balance and the heat on the pins and wires - unless, of course, something is burning, then you know instantly ;-)

Blackwell is really BY FAR the worst and most rushed out Nvidia GPU gen ever.
Rushed but atypically later than normal by 1 quarter or so. Typically Nvidia would have launched a qpu succesion around Autumn of 2024/q4/2024 this dropped at the last day of January 2025. It's not like they had 27 months, similar node, pitless cash on hand, all the ai power and still come up with this.
This is really a foreshadowing that the riase of ai is likely at the cost of everything, quality, software/drivers, reputation, customer service etc etc!
 
True any idea how this whole Chinese ban will affect them?
Also it seems that gaming revenue also took a significant hit from 3.27 billion down to 2.55 billion.
It will affect their revenue greatly, like it did with the 4090/4090D and the Biden administration. But this needs to be seen in context, their revenue spiked a lot and they try to navigate the political waters between the US and China as good as they can, I guess.

And with this administration and the tax war with china things will get even more complex. Politically, China seems to be the main antagonist for the next years.

Nvidia just tries to stay on good terms with Lutnik/Trump to consolidate their data center sales (they could increase with china to about 20B USD revenue p.a.). Also, China has it's ways how to import GPUs through dark channels and they are getting very strong in actively reengeneering hardware (soldering MUCH mor VRAM to existing cards or upgrading the 4090D etc.).

Couple days ago Nvidia took down the 5090D without being forced to do so - that should tell us something:
https://wccftech.com/nvidia-reportedly-prepares-for-a-ban-on-the-geforce-rtx-5090d-in-china/
 
It will affect their revenue greatly, like it did with the 4090/4090D and the Biden administration. But this needs to be seen in context, their revenue spiked a lot and they try to navigate the political waters between the US and China as good as they can, I guess.

And with this administration and the tax war with china things will get even more complex. Politically, China seems to be the main antagonist for the next years.

Nvidia just tries to stay on good terms with Lutnik/Trump to consolidate their data center sales (they could increase with china to about 20B USD revenue p.a.). Also, China has it's ways how to import GPUs through dark channels and they are getting very strong in actively reengeneering hardware (soldering MUCH mor VRAM to existing cards or upgrading the 4090D etc.).

Couple days ago Nvidia took down the 5090D without being forced to do so - that should tell us something:
https://wccftech.com/nvidia-reportedly-prepares-for-a-ban-on-the-geforce-rtx-5090d-in-china/
True. Any idea how much revenue is losing from lack of supply to scalpers charging as much as double and currently 50% over msrp currently? Luckily for Nvidia their is a gaming market to fall back on to lighten the impact of the Chinese ban. They can remarket the 5090d as 5080ti in the western markets and watch gamers eat it up imo.
 
True. Any idea how much revenue is losing from lack of supply to scalpers charging as much as double and currently 50% over msrp currently? Luckily for Nvidia their is a gaming market to fall back on to lighten the impact of the Chinese ban. They can remarket the 5090d as 5080ti in the western markets and watch gamers eat it up imo.
To be honest I think nowadays they are only serving the gaming market for sentimental reasons and because this department simply (still) exists.

But if revenue is 90%+ from data center business, most of the focus is not on B2C and it won't come back, I fear. They don't give a cr** about supply, scalpers, 12VHPWR shortcomings at 600w, insufficient inhouse testing or even the ROP-desaster, because this is B2C and it's of minor importance in 2025. They don't even comment on most of those issues, not even denying them or putting the blame on others. Simple silence.

Just sad, how they treat their private customers and the gaming branch as a whole (which held them in existence all those years when they were struggling before Nov 2022 and later)
 
Back