Watch AMD's 'Where Gaming Begins' Zen 3/Ryzen 5000 announcement right here at 9 AM PT

midian182

Posts: 6,072   +50
Staff member
Highly anticipated: After what's felt like a never-ending wait, AMD is finally unveiling its Zen 3 architecture and the next-generation Ryzen 'Vermeer' desktop CPUs later today. You can watch the "Where Gaming Begins" livestream right here, starting at 9 AM PT/ 11 AM CT / 12 PM ET / 5 PM BST.

Follow-up coverage: AMD Ryzen 5000 launch: "Fastest gaming CPU", higher clocks, higher prices

After months of assuming the new Ryzen desktop CPUs would use the 4000-series moniker, AMD has confirmed recent reports that they will use the 5000-series name, thereby avoiding any confusion with the Zen 2-based Ryzen 4000 mobile processors.

Intel has dominated the CPU industry for years. Many will remember when AMD was considered the inferior choice, especially when it came to gaming, but that started to change with the launch of its Zen architecture—Ryzen processors now account for over 25 percent of CPUs among Steam users, and their many cores combined with excellent price vs. performance results have seen AMD eat away at Intel's market share.

While Intel still leads when it comes to gaming performance, a recent leak suggests there'll be a Ryzen 7 5800X that could outperform team blue's flagship Core i9-10900K. Considering the name of AMD's event, and CEO Lisa Su's declaration that "it's going to be an exciting fall for gamers," the company appears to be aiming at a specific group.

We're expecting a number of upgrades in Zen 3, the most significant being the increase to instructions per clock (IPC). There's also a clock speed boost that could reach 4.9 GHz and numerous other improvements that come from the new architecture. The chips are built on TSMC's N7P or N7+ process, feature 32+ MB of unified cache, and use a Multi-Chip Module (MCM) design.

Permalink to story.

 

Nobina

Posts: 2,663   +2,292
Didn't really wanna wait for new Zens so I got me a Ryzen 5 3600 a few months ago. If I wanted to wait I'd have to wait at least a couple of months more.
 

Maxiking

Posts: 131   +150
Big Navi matches the 3080 4k on Gears 5. If you compare TS 3080: 72 FPS 4k Ultra to the stream 6000 series 73 fps average.

Looking good fro Big Navi, Gears Of War 5 does not seem to favor AMD.

https://www.techspot.com/review/2099-geforce-rtx-3080/
it is barely faster than 2080ti in Borderlands 3.

Also, 3080 rtx review from this site means nothing, they used 3950x, huge bottleneck at 1080p and 1440p. You got play8ed.
 

Evernessince

Posts: 5,413   +5,998
it is barely faster than 2080ti in Borderlands 3.

Also, 3080 rtx review from this site means nothing, they used 3950x, huge bottleneck at 1080p and 1440p. You got play8ed.
Techpowerup has the 3080 only getting 20% at 1440p and even less at 1080p over the 2080 Ti and they tested with a 10900K.

It's not a CPU architecture bottleneck, Nvidia's Ampere doubled FP32 theoretical performance without also increasing other parts of the pipeline (like ROPs).

If AMD's performance is consistent across the board, Nvidia will be in trouble.
 
Last edited:

neeyik

Posts: 1,331   +1,416
Staff member
It's not a CPU architecture bottleneck, Nvidia's Ampere doubled FP32 theoretical performance without also increasing other parts of the pipeline (like ROPs).
Ampere does have some additional changes to Turing other than just more FP32 units - the ROPs are now embedded in the GPCs, rather than to L2 cache, which is why the 3080 has 96, to the 2080 Ti's 88. The L1 data/texture cache and shared memory is larger (in graphics and async compute mode, up to 64 kB can be allocated by the SM for data, 48 kB for shared and 16 kB is reserved - Turing allowed 64 and 32, or 32 and 64 respectively); the L1 bandwidth has doubled too.

The processing blocks in an Ampere SM maintains the separate INT/FP datapaths that Turing introduced, so it's not like it's worse in the new design; the additional float units simply provide more flexibility - where a Turing SM was fixed to doing 64 FP32 and 64 INT32 ops per cycle, Ampere's offers the option of doing 128 float ops per cycle if required.

Nvidia claimed in the Turing launch that the games they profiled showed that the typical distribution of INT to FP operations performed by the SMs had the rough ratio of 9:25 - in other words, there's 2.7 times more FP ops than INT ones going on in the shaders performed in a modern AAA title. It makes sense to have both data paths support float calculations, based on that figure.

And given that, per SM, Ampere doesn't have more data paths than Turing, the underlying architecture doesn't require significantly more changes to support the extra FP32 units, as the data sizes are still the same.
 
  • Like
Reactions: Reehahs

Evernessince

Posts: 5,413   +5,998
Ampere does have some additional changes to Turing other than just more FP32 units - the ROPs are now embedded in the GPCs, rather than to L2 cache, which is why the 3080 has 96, to the 2080 Ti's 88. The L1 data/texture cache and shared memory is larger (in graphics and async compute mode, up to 64 kB can be allocated by the SM for data, 48 kB for shared and 16 kB is reserved - Turing allowed 64 and 32, or 32 and 64 respectively); the L1 bandwidth has doubled too.

The processing blocks in an Ampere SM maintains the separate INT/FP datapaths that Turing introduced, so it's not like it's worse in the new design; the additional float units simply provide more flexibility - where a Turing SM was fixed to doing 64 FP32 and 64 INT32 ops per cycle, Ampere's offers the option of doing 128 float ops per cycle if required.

Nvidia claimed in the Turing launch that the games they profiled showed that the typical distribution of INT to FP operations performed by the SMs had the rough ratio of 9:25 - in other words, there's 2.7 times more FP ops than INT ones going on in the shaders performed in a modern AAA title. It makes sense to have both data paths support float calculations, based on that figure.

And given that, per SM, Ampere doesn't have more data paths than Turing, the underlying architecture doesn't require significantly more changes to support the extra FP32 units, as the data sizes are still the same.
Oh I was not saying that's the only thing Nvidia did, I was just pointing out what I thought was the likely issue.

There was only a 27.27% increase in the number of ROPs compared to a doubling of FP32 capable processing cores. If you averaged 1080p, 1440p, and 4K performance you'd get close to this number in performance gain.

Certainly the performance gains are not in line with the theoretical FP32 performance. Assuming Nvidia's FP:Int ratio is correct, these games should be able to immediately use that extra FP performance yet we are only seeing a fraction of that in practice. To me it appears Ampere is laying the groundwork for Nvidia's next gen GPUs.
 

neeyik

Posts: 1,331   +1,416
Staff member
Certainly the performance gains are not in line with the theoretical FP32 performance. Assuming Nvidia's FP:Int ratio is correct, these games should be able to immediately use that extra FP performance yet we are only seeing a fraction of that in practice. To me it appears Ampere is laying the groundwork for Nvidia's next gen GPUs.
Well at 1080p, the likes of an RTX 3080 absolutely isn't shader bound - nor ROP, TMU, or anything GPU related. At the opposite end of the spectrum, I.e. 4K res or higher, the shaders aren't any more complex, there's just a greater number of them. Each processing block still only contains a single warp scheduler and dispatch unit, and warps are still 32 threads in size. Nvidia's SMs are SIMT (single instruction, multiple thread) so each block handles the same instruction at any one instance (they can take overlapping instructions but they get delayed until the block is 'free').

So if a warp involved a stack of float operations, and nothing else, then the Ampere chip will indeed process them twice as fast as the Turing (as it can do 32 FP ops per cycle, compared to Turing's 16). But since the warps will be a mixture of instructions, only a certain portion of that thread is going to be processed faster on Ampere - once that part is done, the thread becomes idle until the required units and/or data becomes available.

The ROP processing rate on the 3080 is a fraction under 21% faster than it is on the 2080 Ti, whereas the texture processing rate is just under 11% faster. At higher resolutions, these become increasingly more important, simply because there are more pixels to blend, sample data for, and so on. The 3080 has the same number of SMs as the 2080 Ti, so it's not able to handle more warps at any one time, and the warps themselves are no different at 4K compared to 1080p.

If a game's performance was entirely bound to the GPU's number of float ops per second, then Ampere would clearly stand out. But since games aren't like this, the performance is weighted towards factors that are more pixel-bound, such as ROP and TMU rates. The GA102 is designed to be a chip for multiple markets and usage scenarios - check out Puget System's review of the 3080 and 3090:


Specifically look at the V-Ray results to see the benefit that the extra FP32 cycles are bringing (the 3080 is 75% faster than the 2080 Ti).
 

Evernessince

Posts: 5,413   +5,998
Well at 1080p, the likes of an RTX 3080 absolutely isn't shader bound - nor ROP, TMU, or anything GPU related. At the opposite end of the spectrum, I.e. 4K res or higher, the shaders aren't any more complex, there's just a greater number of them. Each processing block still only contains a single warp scheduler and dispatch unit, and warps are still 32 threads in size. Nvidia's SMs are SIMT (single instruction, multiple thread) so each block handles the same instruction at any one instance (they can take overlapping instructions but they get delayed until the block is 'free').

So if a warp involved a stack of float operations, and nothing else, then the Ampere chip will indeed process them twice as fast as the Turing (as it can do 32 FP ops per cycle, compared to Turing's 16). But since the warps will be a mixture of instructions, only a certain portion of that thread is going to be processed faster on Ampere - once that part is done, the thread becomes idle until the required units and/or data becomes available.

The ROP processing rate on the 3080 is a fraction under 21% faster than it is on the 2080 Ti, whereas the texture processing rate is just under 11% faster. At higher resolutions, these become increasingly more important, simply because there are more pixels to blend, sample data for, and so on. The 3080 has the same number of SMs as the 2080 Ti, so it's not able to handle more warps at any one time, and the warps themselves are no different at 4K compared to 1080p.

If a game's performance was entirely bound to the GPU's number of float ops per second, then Ampere would clearly stand out. But since games aren't like this, the performance is weighted towards factors that are more pixel-bound, such as ROP and TMU rates. The GA102 is designed to be a chip for multiple markets and usage scenarios - check out Puget System's review of the 3080 and 3090:


Specifically look at the V-Ray results to see the benefit that the extra FP32 cycles are bringing (the 3080 is 75% faster than the 2080 Ti).
Excellent explanation. Thank you.