AMD Navi vs. Nvidia Turing: An Architecture Comparison

"Unlike the ALUs, they won't be programmable by the end user; instead, the hardware vendor will ensure this process is managed entirely by the GPU and its drivers."

Um, actually, both AMD and Nvidia do provide software for programming their GPUs directly. But while they do this in hopes of enticing vendor lock-in, or for people doing GPU computing on a specific system, in general, so that games will work regardless of which brand of video card you have, game programmers will indeed use the DirectX, OpenGL, or Vulkan drivers provided by the GPU maker. But it isn't because there is some encryption feature locking them out.
If you take the full paragraph from the section you've quoted, it becomes a little clearer as to what I was talking about:
Now, these logic units are going to need something to organize them, by decoding and issuing instructions to keep them busy, and this will be in the form of at least one dedicated group of logic units.Unlike the ALUs, they won't be programmable by the end user; instead, the hardware vendor will ensure this process is managed entirely by the GPU and its drivers.
One can configure the warps handled by schedulers, but not the schedulers themselves. In the case of Turing, for example, the way that SMs handle a block of threads is the same every time: the threads are partitioned into warps and each warp is handled by the scheduler for dispatch. One has control over the dimensions of the thread blocks, but not the warp sizes; one has control over the instruction behaviour of the threads, but not how the dispatch units handle warp divergence and so on.
 
...
The idea that such complex technology is available to people, but all they know about it is the brand, and what it cost to get it. No passion at all imo.
...

If I may continue your off topic for a moment ...

Yes ... no passion - good way of putting it ...

I see the issue as caused by "alignment"; a lazy-butt replacement for knowledge seeking and critical thinking.

People tend to "align" themselves to (or against) a brand, an ideology, or say, a political party, etc.

However, that "alignment" for some becomes so strong that any information that may appear to threaten their "justification" for that alignment, is immediately denied without thought or logic. It becomes a program that runs automatically with no brain needed.

This feeds into the nature of human laziness -- alignments don't take any energy, don't require any thought, reasoning, critical thinking -- all they need is some injected emotion here and there. Research, education, learning, truth finding, sense-making and objectivity all take work, and let's face it we live in the age of lazy phone zombies that have a 5 second attention span.

We all have alignments to some extent - this is normal and to be expected, but its the strength of that alignment that determines the level of blindness and "laziness" inflicted on an individual.

My problem is that I give people too much benefit of the doubt and find myself disappointed with people outright, although I do try to keep things light if interacting in such a way to point it out (I've ticked off a few people on this forum already in a short amount of time without even trying, just by pointing out objective facts).

People with strong alignments tend to get all bent out of shape when you force facts or critical considerations on them that they perceive as threatening to the strength of their alignments. They don't consciously do this - its a program that runs automatically ...

Anyway, back to topic ....
 
It is not called Asynchonous Compute Engine, they are called Shader Arrays.
Additionally, that aren´t 2 CUs, it´s one Dual-Compute-Unit with partly shared resources.

RDNA Whitepaper https://www.amd.com/system/files/documents/rdna-whitepaper.pdf "Site 7" Cite:
"The two shader engines house all the programmable compute resources and some of the dedicated graphics hardware. Each of the two shader engines include two shader arrays, which comprise of the new dual compute units, a shared graphics L1 cache, a primitive unit, a rasterizer, and four render backends (RBs)." cite-end

"Site 12" Cite:
When using the more efficient wave32 wavefronts, the new SIMDs boosts IPC and cuts latency by 4X. cite-end

Do we know if they are used in released and older games automatically?
It could be that Navi runs in wave64-mode in every game, if not everything is tweaked right!
 
Last edited:
It is not called Asynchonous Compute Engine, they are called Shader Arrays.
In some earlier documents, AMD specifically used the term ACE for the aforementioned workgroup clusters (I.e. the shader arrays); however, in the document you linked to, the ACE are specifically stated to manage the compute shaders, something that wasn't overly clear in the earlier documents. If I get chance, I'll edit the article accordingly.

Additionally, that aren´t 2 CUs, it´s one Dual-Compute-Unit with partly shared resources.
AMD themselves class them as being 2 CUs - they count them individually in the product specifications:


You can see that AMD state 40 CUs for the RX 5700 XT. And also see the following presentation:


The same 40 CU count is noted on slide 5. Lastly, on page 21 of the RDNA Whitepaper, the table also states 40.

I suspect AMD calls them 'dual compute units' because the structure contains two independent sets of schedulers, with 2x SIMD32 units and TMUs, that operate independently. Interestingly, earlier documents, including the above presentation, called them Workgroup Processors as preference over Dual Compute Units - the former term appears nowhere in the whitepaper, which suggests AMD have moved to streamline the nomenclature.

"Site 12" Cite:
When using the more efficient wave32 wavefronts, the new SIMDs boosts IPC and cuts latency by 4X. cite-end

Do we know if they are used in released and older games automatically?
It could be that Navi runs in wave64-mode in every game, if not everything is tweaked right!
The choice of wavefront size is determined by the shader compiler within the drivers, so the choice was never really part of the game code itself. That said, how the shader code is written, especially in terms of instruction parallelisation, will have an impact on how the code is compiled, so any games designed with GCN in mind, would have to be tweaked in order to gain the maximum benefit of RDNA's architectural benefits over GCN.

In the presentation linked above, AMD state that wave32 is usually for vertex and compute shaders, whereas pixel shaders are usually wave64, so it will be a mixed bag anyway.
 
Because pointless fortunetelling, and responding to pointless fortunetelling is the absolute best some people can muster. I used to think the majority of people participating in tech forums were intelligent ... that opinion is changing rather rapidly ...

If you think you can judge someone's intelligence based on one forum post, you are mistaken. There's mostly nothing particularly intelligent about talking tech, but it's fun to some people.
Also: it wasn't warranted at all to start denigrating people. A bit toxic, actually.
 
Last edited:
It's worth mentioning that Pascal does decode VP9 up to 8k60 in hardware DXVA2 (without PixelShader processing like AMD) even if it would not be able to display it over DisplayPort or HDMI outputs.
 
Thank you Nick,
the wave32 vs wave64 differenciation in those usecases was very enlighting for me.

I didn´t read the older naming beside calling the shader arrays, just like that.
The little compute scheduler parts where called ACE and HWS, for so long, I didn´t belive AMD could mix things up like that, sorry.

Also I didn´t want to be harsh I any kind, like it was forbidden to sum up the CUs of all WGP or Dual-Compute-Units. I just wanted to say, it´s not 2 seperate CUs anymore, when describing the architecture.

Nick, appreciate your effort, putting those articles together,
kind regards
 
Thank you for the feedback and kind word, @deathtrap :)

On the point of the compute units and wavefronts, it's worth compare GCN to Navi more closely:

2019-08-03-image.png


The first element to consider are the SIMD logic units - GCN has four groups of 16 ALUs per CU, whereas Navi has two groups of 32 ALUs. So in both cases, wave64 uses the whole CU, but GCN only has one scheduler for all four SIMD16 groups, whereas Navi has one scheduler for each SIMD32 group.

This results in each GCN SIMD16 group being able to issue an instruction only once per 4 cycles, compared to Navi's 1 instruction per cycle (because each SIMD32 group has its own scheduler). This means a wave64 wavefront will need far more cycles to process the waves compared to Navi (which is why the likes of the RX 5700 XT performs at least as well, and better, than the Vega 64 despite having fewer ALUs).
 
The 2 occurrences of "Asynchronous Compute Engine" should be "Shader Array".
I don't think there is such a thing of "waves of 16 threads".
See comments #16 and #54 - older AMD documents on RDNA, on which this article was based, labelled the entire Shader Arrays as ACEs, although this has now changed in the latest documents. There's also no longer any reference to wave16 so it's possible that either that was a mistake in their original material, or support for it was deemed unnecessary (wave size is determined by the shader compiler in the drivers, so it would be easy enough to enable/disable).
 
Back