Nvidia says resolution upscaling like DLSS (and not native resolution) is the future


Yes, they started using chiplets in the instinct line. But creating an efficient design for games requires overcoming shortcomings such as latency between chips, bandwidth, etc...
yeah, chiplets for gaming cards is a mirage imo. using ML for more efficient single die design will ultimately win in the end.
and if anyone gets mcm to work as intended in games, it'll be nvidia with their r/d budget.
amd couldn't do that with rdna3, and doesn't look like they're even planning it for rdna4 as well.
Reality is, I don't suppose a gpu maker with single digit market share will be interested in developing such expensive manufacturing technology for gaming gpus. I mean, putting two/three/four dies on an interposer is easy, but to get them to scale is another thing.AMD can't even get their dual issue fp32 to work as intended.......
 
yeah, chiplets for gaming cards is a mirage imo. using ML for more efficient single die design will ultimately win in the end.
and if anyone gets mcm to work as intended in games, it'll be nvidia with their r/d budget.
amd couldn't do that with rdna3, and doesn't look like they're even planning it for rdna4 as well.
Reality is, I don't suppose a gpu maker with single digit market share will be interested in developing such expensive manufacturing technology for gaming gpus. I mean, putting two/three/four dies on an interposer is easy, but to get them to scale is another thing.AMD can't even get their dual issue fp32 to work as intended.......
AMD based its entire strategy from the development of Zen on chipplets and flexible designs, from then until now they have focused on this, and now AMD has the most time invested, IP and experience than any other company on the planet in this regard. If AMD can't create a modular design for gaming, I wouldn't bet my chips on another company.

The RDNA3 design works exactly as intended, but brute force requires appropriate development on the devs' side, Starfield was one of the first games to use this as the technical analysis shows: https://chipsandcheese.com/2023/09/...erformance-on-nvidias-4090-and-amds-7900-xtx/
 
Well, sure Nvidia is going to want to push DLSS. The want you to see those high FPS with DLSS on current gen cards; not compare full resolution rendering where (other than the RTX 4090) you have 40x0 model cards that are "somewhat" but not much faster than the 30x0 equivalents and in some cases even the 20x0 equivalents.

 
What does it mean for small game studios?
Will it be as easy as pressing a button to add upscaling into a game?
If not they will not be able to use it, or rather not able to afford to do it.
In any case, I would rather avoid it purposely just to make sure I do
not support their evil plans.
Except DLSS looks way worse in reality; very blurry. I never use it; native looks far better and sharper.
Fully agree all this so called up scaling I see no improvements at all and always end up running all games at native settings the so called 4K is not visible at all nor does it improve anything.
So all these hypes are worthless for me I stay at native res and am happy with the clear images.
 
The RDNA3 design works exactly as intended

7900gre has 46tflops, 6950xt has 23.5 and less bandwidth
guess which one is faster.

brute force requires appropriate development on the devs' side, Starfield was one of the first games to use this
28% over 6950xt for 7900xt is indeed close to 30-35% that 3080 brought over 2080Ti with 29.7tf vs 13.5tf (same as 7900xt vs 6950xt, 2.2x ratio), but I still don't get why amd's dual issue fp32 only works in starfield, while nvidia's ampere dual isse fp32 worked on launch day, in games that were obviously released before the dual issue architecture was released.

if amd's rdna3 requires developers to optimize for amd's version of dual issue fp32 specifically, I don't think we'll see many games performing like starfield.
 
Last edited:
If NVIDIA believes that, then they ought to be looking at open standards. Otherwise it's self serving.
 
What does it mean for small game studios?
Will it be as easy as pressing a button to add upscaling into a game?
If not they will not be able to use it, or rather not able to afford to do it.
In any case, I would rather avoid it purposely just to make sure I do
not support their evil plans.
Yes. I actually found this doc for Unity, https://docs.unity3d.com/Packages/com.unity.render-pipelines.high-definition@12.0/manual/deep-learning-super-sampling-in-hdrp.html and it looks like it's almost exactly that -- you turn on a DLSS flag for your assets, turn on a DLSS flag for your camera(s). I don't know what it does after that -- I found info from 4 or 5 years ago suggesting info was sent to Nvidia and it ran the DLSS training on basically an AI supercomputer they run; but perhaps it can do this training locally. Whichever, it sounds like the engine takes care of it for you, you just wait a while and get your results.
Fully agree all this so called up scaling I see no improvements at all and always end up running all games at native settings the so called 4K is not visible at all nor does it improve anything.
Indeed. After all, it's taking a lower-res render that will be missing detail and visual detail, using this pre-trained model to "guesstimate" what the missing visual detail should look like. It seems logical that the BEST it could if DLSS was 100% perfect is look as good as native. As much as Nvidia may want it to be the case... if it looks any "sharper" or "better" than native, it means DLSS is applying excess sharpening compared to the game engine intent, or adding extra details the game engine is not actually putting into the scene.
 
but I still don't get why amd's dual issue fp32 only works in starfield, while nvidia's ampere dual isse fp32 worked on launch day, in games that were obviously released before the dual issue architecture was released.

if amd's rdna3 requires developers to optimize for amd's version of dual issue fp32 specifically, I don't think we'll see many games performing like starfield.
This sounds like a shader compiler problem to me. CPUs tend to have aggressive instruction reordering built into them (usually... the Atoms didn't and I'm not sure the Atom-based Intel "E-Cores" do either), to make sure the CPU is running things as quickly as possible. GPUs don't. I suspect Starfield has shaders that issue FP32 instructions back-to-back, while some other games (designed assuming you don't have dual-issue FP32) issue FP32 work, go do something else (on the assumption that running a second set of FP32 work would have stalled, it does something else instead), then issues the second set of FP32 work. I'm assuming Nvidia's shader compiler reorders the instructions to get you back-to-back FP32 instructions while AMDs currently doesn't.

The good news there, most likely you'll get an AMD driver update with improved shader compiler... have to wait around while the game in question precompiles it's shaders again.. at which point you can get the benefit of dual-issue FP32 in those games where right now you aren't without the games having to change their shaders at all.
 
Indeed. After all, it's taking a lower-res render that will be missing detail and visual detail, using this pre-trained model to "guesstimate" what the missing visual detail should look like. It seems logical that the BEST it could if DLSS was 100% perfect is look as good as native. As much as Nvidia may want it to be the case... if it looks any "sharper" or "better" than native, it means DLSS is applying excess sharpening compared to the game engine intent, or adding extra details the game engine is not actually putting into the scene.
Yes I have been trying to explain this to people that claims dlss looks better than native is probably due the person mistaking an over sharpen image for a better looking one.
 
This sounds like a shader compiler problem to me. CPUs tend to have aggressive instruction reordering built into them (usually... the Atoms didn't and I'm not sure the Atom-based Intel "E-Cores" do either), to make sure the CPU is running things as quickly as possible. GPUs don't. I suspect Starfield has shaders that issue FP32 instructions back-to-back, while some other games (designed assuming you don't have dual-issue FP32) issue FP32 work, go do something else (on the assumption that running a second set of FP32 work would have stalled, it does something else instead), then issues the second set of FP32 work. I'm assuming Nvidia's shader compiler reorders the instructions to get you back-to-back FP32 instructions while AMDs currently doesn't.

The good news there, most likely you'll get an AMD driver update with improved shader compiler... have to wait around while the game in question precompiles it's shaders again.. at which point you can get the benefit of dual-issue FP32 in those games where right now you aren't without the games having to change their shaders at all.
There are two main shader pipelines in RDNA 3 -- vector (handles almost all of the math ops for floats and integer values) and scalar (for all of the logic stuff), but dual issue only works for one specific vector ALU microcode: VOPD. It offers a fairly decent range of instructions (MUL, ADD, SUB, MOV, MAX, MIN, etc) but there are some tight restrictions placed on the use of VOPD, though -- for example, the paired set of instructions must be completely independent and the first pair supports 5 fewer opcodes than the second pair. It's a challenge for the compiler to do this right every time, without additional support/interference.

For Ampere/Ada Lovelace, there are three primary shader pipelines in the SM: ALU, FMA and FMAHeavy. The first does all of the bit manipulation, logic processing, and INT32 stuff bar MUL instructions, the second handles FP32 & FP16 MUL/ADD and the last one does FP32 MUL/ADD and INT32 MUL. So the compiler has an easier job of it, as it can almost always 'dual issue'.
 
There are two main shader pipelines in RDNA 3 -- vector (handles almost all of the math ops for floats and integer values) and scalar (for all of the logic stuff), but dual issue only works for one specific vector ALU microcode: VOPD. It offers a fairly decent range of instructions (MUL, ADD, SUB, MOV, MAX, MIN, etc) but there are some tight restrictions placed on the use of VOPD, though -- for example, the paired set of instructions must be completely independent and the first pair supports 5 fewer opcodes than the second pair. It's a challenge for the compiler to do this right every time, without additional support/interference.

For Ampere/Ada Lovelace, there are three primary shader pipelines in the SM: ALU, FMA and FMAHeavy. The first does all of the bit manipulation, logic processing, and INT32 stuff bar MUL instructions, the second handles FP32 & FP16 MUL/ADD and the last one does FP32 MUL/ADD and INT32 MUL. So the compiler has an easier job of it, as it can almost always 'dual issue'.
so a design obstacle on amd's part.
thanks for explaining, you never disappoint.
 
so a design obstacle on amd's part.
Not a design obstacle -- just how it all works. In all three versions of RDNA, threads are issued in groups called waves, and each SIMD unit (or more rather, SIMT unit) can work on waves that involve 32 or 64 data points.

In the case of the latter, this requires the instruction from the shader that the wave will process to be issued twice -- once for the first 32 points and again for the second lot of 32. For the most part, compute and vertex/mesh/hull shaders are compiled as wave32 and pixel shaders are generally wave64.

In RDNA and RDNA 2, this would take at least 2 cycles because each SIMD unit could only work on 32 points. With RDNA 3, this has now doubled, so it's all single cycle, so it's a form of dual-issuing to get this done (it's the same instruction, though, so not 'true' dual issuing). VOPD instructions have strict limitations because the SIMD units aren't really two completely separate 32-wide ALUs -- it's a single structure with 64 32-bit ALUs.

The Compute Units already comprise two SIMDs, and in RDNA 3, they have large multi-banked and cached register files (192 kB) to facilitate the use of VOPD, but mostly just to improve shader occupancy. Removing the limitations on VOPD would essentially make the CUs 4-wide and every other resource would have to be significantly altered to account for this.

As it is, VOPD and the associated hardware changes are a nice extra and add very little to the overall transistor count, but the payback is the reliance on the compiler being able to identify when VOPD can be used. But this also means that a decent software engineer or two can hand-tweak a few shaders to avoid all of this and reap the benefit.

One could argue that Nvidia's system is 'better' but the payback there is that the chips are far larger -- even more so because Nvidia uses extra SMs to make up for the fact that the smaller register files (64 kB) limit things somewhat and shader occupancy tends to be a lot lower than in AMD cards.
 
One could argue that Nvidia's system is 'better' but the payback there is that the chips are far larger -- even more so because Nvidia uses extra SMs to make up for the fact that the smaller register files (64 kB) limit things somewhat and shader occupancy tends to be a lot lower than in AMD cards.
still, the die space they saved didn't allow them to have a 4090-equivalent card, cause of power draw that n31 requires already.

power-gaming.png


Maybe doing it the same way as nvidia would require more die space indeed, but at the same time the performance advantage over rx6000 series would probably double.
the earliest we'll see the amd method getting its specific dual-issue optimization on a larger scale might take as long as rdna3 (or later) consoles.
 
still, the die space they saved didn't allow them to have a 4090-equivalent card, cause of power draw that n31 requires already.
Don't forget that all of the Ada chips are on a custom N4 node -- there's obviously no details as to what this entails, but TSMC's standard N4P process is 22% more power efficient than N5 (along with a small boost to performance and overall transistor density). Add in the fact that N6 is even less efficient, and the power cost for the GCD-MCD fabric, then it's no surprise that the Navi 31 chip is pretty hefty on the power consumption.

AMD needed to make its Gaming division more cost-effective and the only way to do that is by improving yields and operating margins. The former is certainly better than the Navi 2x and while it's not possible to tell exactly how much better Navi 3x is for the latter, the fact that the division has had a consistently positive operating margin for two years shows that something is working, at least.

The bulk of that sector's revenue comes from selling APUs to Microsoft and Sony, and the margins on those chips are likely to be very small. The division's mean margin is 16% (since inception) so AMD is getting a decent return on its GPUs.

Maybe doing it the same way as nvidia would require more die space indeed, but at the same time the performance advantage over rx6000 series would probably double.
It's just not AMD's way of doing things. Every update from RNDA through to RDNA 3 has involved relatively minor changes in the architecture -- the fundamental graphics core hasn't changed all that much.

One thing to perhaps note is that RDNA doesn't seem to scale as well as Nvidia's design, in terms of unit count. The Navi 10 was 40 CUs and that first card, on release, performed around the same as a 2060 Super/2070, cards with 34 & 36 SMs respectively.

The Navi 31 has 96 CUs -- 2.4 times more than the Navi 10. Scale those Turing chips by the same level and one has a GPU with 82 to 86 SMs, in the region of 40 fewer than that in the 4090 (even more so for the full AD102). Using the same level of scaling from a 2060 Super to a 4090 on the Navi 31 would give it 150 or so CUs.

Of course, Ampere and Ada chips have two FP pipelines per SM partition, so the gain in processing is even greater. To make Navi in the same way would push the die size far too big and, to try and bring this back around to the topic of the article, this is also why AMD's gaming GPUs don't have large, complex ASICs for tensor calculations or BVH traversal acceleration.

In many ways, AMD is fortunate that Nvidia chose to uplift GPUs into higher SKU tiers than in previous generations and then charge a small fortune for them. They were always going to be more expensive due to using a custom, highest-level node available on the market, but if the 4080 used a 100 SM AD102, and the 4070 used the AD103, and so on, AMD's RDNA 3 product lineup wouldn't have looked half as good as it does.
 
This dropped a lot with the end of (profitable) mining via GPUs. Furthermore, leaving aside the 8-10% of workstation GPUs, some 60-70% of the remaining volume are mid-end and a lot of entry-level GPUs.

Anyone thinking that Nvidia has brought something beneficial to the Industry since the launch of Turing and the beginning of the RT and magical Upscaling saga. Look at the AAA games from that time(pre Turing), the requirements, then compare them to today games. Ask yourself this question: Has there been a reasonable graphical evolution between what is presented on the screen and the 4-5x higher requirements?
It did drop in 2022, for a couple of reasons. First, we didn't have Covid keeping people at home. Second, we had the end of mining and third, we had ridiculous prices. That said, it still doesn't mean that discrete GPUs are a "niche".

I would say that there has been a gradual evolution of graphics quality in AAA games over many years. Whether game developers take full advantage of GPU features is on them, not Nvidia.
 
I would say that there has been a gradual evolution of graphics quality in AAA games over many years. Whether game developers take full advantage of GPU features is on them, not Nvidia.
Exactly, a card is a tool for doing work, hardware is not to blame for whatever someone does with it.
 
AMD also have integrated GPUs that are more powerful than some Nvidia discrete ones. For market share numbers, one RTX 4090 adds as much market share as do GT710.

In other words, more powerful integrated GPUs AMD makes, less demand there is for weak discrete GPUs. To be more precise, if one wants GPU for non-gaming and non-heavy loads, Ryzen 7000-series iGPU (NOT discrete one) is good enough from AMD. However if one wants Nvidia GPU, that MUST be discrete since Nvidia does not sell iGPUs.

And again, on sales figures any discrete GPU is one unit sold, no matter if it's RTX 4090 or 40 dollar trash one.

That's why looking discrete share for only units sold is plain stupid.
In the context of this discussion, we're not talking about non-gaming or non-heavy workloads. Why would you ever want to do frame gen on an app that doesn't require it?

CPU vendors get the benefit of iGPU from a market share perspective but that doesn't mean people (gamers) are actually using the integral GPUs for anything other than emergency purposes; like your main GPU fails. When it comes to gaming and frame-gen, I don't think anyone is thinking about current iGPUs.

So, back to my original point. Discrete GPUs are not "niche", not by a long shot.
 
In the context of this discussion, we're not talking about non-gaming or non-heavy workloads. Why would you ever want to do frame gen on an app that doesn't require it?

CPU vendors get the benefit of iGPU from a market share perspective but that doesn't mean people (gamers) are actually using the integral GPUs for anything other than emergency purposes; like your main GPU fails. When it comes to gaming and frame-gen, I don't think anyone is thinking about current iGPUs.

So, back to my original point. Discrete GPUs are not "niche", not by a long shot.
Some were talking about market shares and only discrete matters. But again, nowhere near all discrete cards can do frame gen. So looking just discrete share tells very little. AMD of course gets more total share just because all Ryzen 7000-series CPUs have GPU but since Nvidia do not sell iGPUs, most trash discrete cards are Nvidia ones.

And again, some iGPUs are much faster than low end trash Nvidias. And there just discrete market share without knowing which cards actually can do frame gen is pretty much pointless discussion.
 
Some were talking about market shares and only discrete matters. But again, nowhere near all discrete cards can do frame gen. So looking just discrete share tells very little. AMD of course gets more total share just because all Ryzen 7000-series CPUs have GPU but since Nvidia do not sell iGPUs, most trash discrete cards are Nvidia ones.

And again, some iGPUs are much faster than low end trash Nvidias. And there just discrete market share without knowing which cards actually can do frame gen is pretty much pointless discussion.
AMD has 1/3 the market share compared to Intel, so if you're going on iGPU, Intel wins. But, of course, integrated GPUs do not matter in this discussion. Even if you exclude discrete GPUs that can't do frame gen, Nvidia still wins.

And, once again, the point is that discrete GPU are not niche.
 
AMD has 1/3 the market share compared to Intel, so if you're going on iGPU, Intel wins. But, of course, integrated GPUs do not matter in this discussion. Even if you exclude discrete GPUs that can't do frame gen, Nvidia still wins.

And, once again, the point is that discrete GPU are not niche.
And how about consoles that have iGPU? They are expected to support FSR3 frame gen.

Point being, saying Nvidia this and that blah blah because discrete market share is mostly trash talk.
except they aren't, you should fact check things you're discussing.
cheapest gt1030 wins with every igpu, gtx1630 wins with apus by the same 30%.
igp-relative-performance-1920-1080.png
Oh, so APU is not iGPU now or what :confused:

GT1030 is more expensive (and faster) than GT710. It barely qualifies low end trash, and still like I said, there are iGPUs that are faster.

GTX1630 costs around $150. And still barely whacks iGPU. No wonder Techspot gave it 20/100.

Thanks for proving me right.
 
Back