AMD explains how the new 3D V-Cache improves over the original

mongeese

Posts: 643   +123
Staff
In context: AMD launched the Ryzen 9 7950X3D at the end of last month and welcomed an enthusiastic response to its second-gen 3D V-Cache despite some mixed opinions about its usefulness in a 16-core CPU. Now they've shared some of the technical details that explain its performance.

AMD started mixing nodes in 2019 when it used the 7 nm node for the core complex die (CCD) and the 12 nm node for the IO die of the Zen 2 microarchitecture. AMD recently confirmed to Tom's Hardware that Zen 4 steps it up to three nodes: the 5 nm node for the CCD, the 6 nm node for the IO die, and the 7 nm node for the V-Cache.

AMD explained some of the challenges it faced stacking one node onto another during its recent ISSCC presentation. Both the 7950X3D and the original 5800X3D have their V-Caches positioned over their regular L3 caches to allow them to be connected. The arrangement also keeps the V-Cache away from the heat produced by the cores. However, while the V-Cache fits neatly over the L3 cache in the 5800X3D, it overlaps with the L2 caches on the edges of the cores in the 7950X3D.

Also read: AMD Ryzen 9 7950X3D Memory Scaling Benchmark

Part of the problem was that AMD doubled the amount of L2 cache in each core from 0.5 MB in Zen 3 to 1 MB in Zen 4. But it worked around the additional space constraints by punching holes through the L2 caches for the through-silicon vias (TSVs) that deliver power to the V-Cache. The signal TSVs still come from the controller in the center of the CCD but AMD tweaked them too to reduce their footprint by 50%.

AMD shrunk the V-Cache down from 41 mm2 to 36 mm2 but maintained the same 4.7 B transistors. TSMC fabricates the cache on a new version of the 7 nm node that it developed especially for SRAM. As a result, the V-Cache has 32% more transistors per square millimeter than the CCD despite the CCD being manufactured on the much smaller 5 nm node.

All of the refinements and workarounds AMD implemented add up to a 25% increase in bandwidth to 2.5 TB/s and an unspecified increase in efficiency. Not bad for nine months between the first and second generations of a supplemental chiplet. Hopefully it shows its value when the Ryzen 7 7800X3D arrives in a month's time.

Permalink to story.

 
"AMD shrunk the V-Cache down from 41 mm2 to 36 mm2 but maintained the same 4.7 B transistors. TSMC fabricates the cache on a new version of the 7 nm node that it developed especially for SRAM. As a result, the V-Cache has 32% more transistors per square millimeter than the CCD despite the CCD being manufactured on the much smaller 5 nm node."

TSMC always surpassing itself and showing why there is no competition to match

 
I'm really looking forward to the 7800X3D. I've seen simulated benchmarks where they disable the non v-cache CCD effectively turning the 7950X3d into a 7800X3D and it increased performance number by a not small margin.

The 7950X is still a great gaming CPU so if you need more cores I still think that would be the way togo over the 7950X3D. I've read that the 7950X3d and 7900X3D are basically things being pushed by AMDs marketing department, although those are just rumors from "insiders". Still, with the simulated benchmarks I see of the 7800X3D that would explain the month delay between those higher core count parts and the 7800X3D.

With all of the scheduling issues in the multi CCD X3D parts and the hack job solutions such as Microsofts xbox game bar I see those higher end parts as a solution without a problem. I know I'm in the minority here, but these parts don't seem like they would perform very will in Linux due to the scheduling and the solutions to that scheduling.

Even if they put 3D V-Cache on both dies there would be a latency issue between the CCDs, Similar to the zen 1 threadripper issues. I don't know how they solved this in later generations or with EPYC CPUs so maybe it's less of an issue than I'm making it out to be.
 
"AMD shrunk the V-Cache down from 41 mm2 to 36 mm2 but maintained the same 4.7 B transistors. TSMC fabricates the cache on a new version of the 7 nm node that it developed especially for SRAM. As a result, the V-Cache has 32% more transistors per square millimeter than the CCD despite the CCD being manufactured on the much smaller 5 nm node."

TSMC always surpassing itself and showing why there is no competition to match
Not so fast. Intel gave us High-K metal gate, FinFET's and could beat TSMC to RibbonFET's.
 
Not so fast. Intel gave us High-K metal gate, FinFET's and could beat TSMC to RibbonFET's.
True, however they were stuck on their "14nm" process for several generations, although it's not that simple. Intel didn't start to really be competitive until AMD forced it's hand using TSMC to make its chips.

Intel has made enormous contributions to chip design but they can only stand on their past accomplishments for so long. They did nearly stagnate until basically the 12000 series
 
True, however they were stuck on their "14nm" process for several generations, although it's not that simple. Intel didn't start to really be competitive until AMD forced it's hand using TSMC to make its chips.

Intel has made enormous contributions to chip design but they can only stand on their past accomplishments for so long. They did nearly stagnate until basically the 12000 series

Intel wasn't competitive? They were ahead until Ryzen 5000 with AMD getting a small gaming lead, then went ahead again with 12th gen which was also a lot cheaper on release. 13th gen was ahead of Ryzen 7000s until these 7x3d CPUs. Basically we're getting competition and whichever released last has been better recently.

Wouldn't surprise me if 14th gen mostly pulls ahead again. For the most part all of these recent CPU gens won't bottleneck for gaming, just 3dvcache when it helps can give huge gains, but for many games there is no difference or it's worse because clock speed and ringbus tend to be better for Intel, as well as L2 cache.

For productivity, Intel is pretty far ahead with those e-cores. Since my use case is gaming I'd rather just have the 3dvcache so for me the 7800X3D is going to be my choice, then swap that out with whatever the last v-cache CPU will be years from now. Otherwise I'm content to sit on CPUs for several years at a time, I'm not going to swap a new CPU out every year or two and have no desire to do so.
 
Intel wasn't competitive? They were ahead until Ryzen 5000 with AMD getting a small gaming lead, then went ahead again with 12th gen which was also a lot cheaper on release. 13th gen was ahead of Ryzen 7000s until these 7x3d CPUs. Basically we're getting competition and whichever released last has been better recently.

Wouldn't surprise me if 14th gen mostly pulls ahead again. For the most part all of these recent CPU gens won't bottleneck for gaming, just 3dvcache when it helps can give huge gains, but for many games there is no difference or it's worse because clock speed and ringbus tend to be better for Intel, as well as L2 cache.

For productivity, Intel is pretty far ahead with those e-cores. Since my use case is gaming I'd rather just have the 3dvcache so for me the 7800X3D is going to be my choice, then swap that out with whatever the last v-cache CPU will be years from now. Otherwise I'm content to sit on CPUs for several years at a time, I'm not going to swap a new CPU out every year or two and have no desire to do so.
In my opinion, Intel came up ahead only by means of brute force, I.e. pushing clockspeed. And we all know the side effect of doing so. As for the E-cores, I would say it is a genius and at the same time a cheapo approach. Genius because they are using the E-cores to make up for the lack of threads, which is why they spam so many E-cores, up to 16, just to try and close the multi-threading performance gap. So essentially they are using it to mask the lack of multithreading performance, and not really meant for "Efficient" purpose. Single thread wise, the performance won't be lacking because the P-cores are powerful and pushed very hard to achieve high clockspeed. But at the end of the day, people buying an i7 or i9 are paying a lot for cheap E-cores, and half the number of cutting edge processing cores. Being an Alder Lake user, the increase in E-cores and power consumption don't make Raptor Lake look attractive.
 
Intel wasn't competitive? They were ahead until Ryzen 5000 with AMD getting a small gaming lead, then went ahead again with 12th gen which was also a lot cheaper on release. 13th gen was ahead of Ryzen 7000s until these 7x3d CPUs. Basically we're getting competition and whichever released last has been better recently.

Wouldn't surprise me if 14th gen mostly pulls ahead again. For the most part all of these recent CPU gens won't bottleneck for gaming, just 3dvcache when it helps can give huge gains, but for many games there is no difference or it's worse because clock speed and ringbus tend to be better for Intel, as well as L2 cache.

For productivity, Intel is pretty far ahead with those e-cores. Since my use case is gaming I'd rather just have the 3dvcache so for me the 7800X3D is going to be my choice, then swap that out with whatever the last v-cache CPU will be years from now. Otherwise I'm content to sit on CPUs for several years at a time, I'm not going to swap a new CPU out every year or two and have no desire to do so.
I would not be surprised at all if the 14 series pulls ahead, either. The thing was that intel was only getting a few percentage points increase in performance for several generations until AMD started putting pressure on them. AMD started narrowing the gap fairly quickly and Intel had actually go for more than ~%5 increases in performance between generations. People forget that Intel did very little between the 3000 and 9000 series. The difference between the 3000 and the 9000 series is almost the same as between the 11 series and 13 series.

For a very long time we weren't seeing much movement in the CPU space. AMD had the dumpster fire that was Bulldozer. The last interesting CPU to come out before AMD's 3000 series was probably the 4770k, that was the last CPU I actually found interesting before zen. Does anyone even really remember the CPUs between the 4000 series and 9000 series?
 
For a very long time we weren't seeing much movement in the CPU space. AMD had the dumpster fire that was Bulldozer. The last interesting CPU to come out before AMD's 3000 series was probably the 4770k, that was the last CPU I actually found interesting before zen. Does anyone even really remember the CPUs between the 4000 series and 9000 series?

Yes, 6700K was quiet interesting, but not by itself but longevity of Skylake derivatives, though arch itself was a masterpiece of productivity, efficiency and power. Especially Skylake IMC that remains best DDR4 controller yet, in form of Comet Lake IMC that makes it on Gear 1. Actually, without "gears" at all.
2600K was last outstanding CPU, 8700K aka "finally, Intel", was nice. 10900K, Skylake's swansong, as noted, one of the best IMC, 10 monolith cores for desktop.
On the other hand...2700K aka "refresh", 3770K "meh" 22nm refresh with thermoshit under IHS, Hassfails with IVR that had to be marvelous turned to be disaster, dire 7700K that remains to be worst 4core per $ of all times, 9900K aka "Brown Dwarf", 11900K aka... I dunno, Haswell took fail reward already :D
 
Yes, 6700K was quiet interesting, but not by itself but longevity of Skylake derivatives, though arch itself was a masterpiece of productivity, efficiency and power. Especially Skylake IMC that remains best DDR4 controller yet, in form of Comet Lake IMC that makes it on Gear 1. Actually, without "gears" at all.
2600K was last outstanding CPU, 8700K aka "finally, Intel", was nice. 10900K, Skylake's swansong, as noted, one of the best IMC, 10 monolith cores for desktop.
On the other hand...2700K aka "refresh", 3770K "meh" 22nm refresh with thermoshit under IHS, Hassfails with IVR that had to be marvelous turned to be disaster, dire 7700K that remains to be worst 4core per $ of all times, 9900K aka "Brown Dwarf", 11900K aka... I dunno, Haswell took fail reward already :D
The 6700k was partially interesting but Intel is capable of so much more when they put their minds to it. But when one of the main talking points is that the 7000 series was so bad it made the 6000 series notable, was it really that good? It was certainly better than anything AMD had to offer at the time, but if you were on a 2000 series was the difference great enough to upgrade to the 6000 series? I can see how some people might justify it, but I went from a 3770k to an 1800X. It was just a boring time for CPUs at the time. The 8000 series was a decent line of chips but then we got stuck with the 9000 series. The 10 series was actually better than most people give it credit for but Intel quickly dropped back on a node due to yeild problems to release the 11 series.

Thankfully we have the 12 and 13 series, which are both a fantastic line of products in both price and performance. I am sticking with AMD for reasons outside of performance alone. It's mainly because I'm a Linux user but the next X3D chips with 2 CCDs have scheduling problems. I'd much rather have a 7800X3D than a 7950X3D but if I needed that many threads I'd go with the 7950X in a heartbeat, not the 3D variant
 

The increased L3 cache in Intel/AMD processors is just a crutch like the SLC cache in an SSD. Outside it, the read / write speed immediately drops by an order of magnitude. This is never an option.

They have 2 options:
1. Placing on the SoC chip 512-1024 bit HBM memory in the amount of 8-10GB as a dedicated VRAM.
But in this case, the whole system, taking into account the transition to pci-e 5.0, still turns out to be a bottleneck for a bunch of high-performance devices in terms of the speed of the memory controller.

2. 512+ bit memory controller like Apple in M2 Max with 400Gbytes/s throughput vs shamed only 80Gbytes/s for top i9 Raptor.
In the this case, a minimum of 8 SO-DIMM channels will be required. Or increasing the channel width of a single SO-DIMM to 256 bits or more.

All this greatly complicates the motherboard of laptops and PCs. And it is especially difficult to do it qualitatively and reliably with the memory that is inserted.

And here serialization of the interface will not help, as was the case with the transition from IDE to SATA or from DVI to DP. Memory has another problem - a gradual increase in latency. Look at what happens even with the L3 cache in modern processors - it's a shame! L3 cache latency has almost doubled in terms of access time! Caches and memory are getting faster only in linear read/write/copy operations, not in atomic ones. Here they already significantly lose to the old DDR3 memory. And L1..L3 caches of old processors.

As a result, we only increase the linear speed of memory, but not random access. It's getting worse and worse...

That's why I'm writing - silicone chips have no future. The entire current industry is already at a complete standstill. It is necessary to switch to photonics and some other schemes for implementing RAM, which makes it possible to reduce latency by several times and at the same time increase linear speeds by an order of magnitude right now. Apple did this but at the cost of desoldering the memory.

If x86 comes to this, it's a problem in terms of system memory upgrades. When more is needed...
 
The increased L3 cache in Intel/AMD processors is just a crutch like the SLC cache in an SSD. Outside it, the read / write speed immediately drops by an order of magnitude. This is never an option.

They have 2 options:
1. Placing on the SoC chip 512-1024 bit HBM memory in the amount of 8-10GB as a dedicated VRAM.
But in this case, the whole system, taking into account the transition to pci-e 5.0, still turns out to be a bottleneck for a bunch of high-performance devices in terms of the speed of the memory controller.

2. 512+ bit memory controller like Apple in M2 Max with 400Gbytes/s throughput vs shamed only 80Gbytes/s for top i9 Raptor.
In the this case, a minimum of 8 SO-DIMM channels will be required. Or increasing the channel width of a single SO-DIMM to 256 bits or more.

All this greatly complicates the motherboard of laptops and PCs. And it is especially difficult to do it qualitatively and reliably with the memory that is inserted.

And here serialization of the interface will not help, as was the case with the transition from IDE to SATA or from DVI to DP. Memory has another problem - a gradual increase in latency. Look at what happens even with the L3 cache in modern processors - it's a shame! L3 cache latency has almost doubled in terms of access time! Caches and memory are getting faster only in linear read/write/copy operations, not in atomic ones. Here they already significantly lose to the old DDR3 memory. And L1..L3 caches of old processors.

As a result, we only increase the linear speed of memory, but not random access. It's getting worse and worse...

That's why I'm writing - silicone chips have no future. The entire current industry is already at a complete standstill. It is necessary to switch to photonics and some other schemes for implementing RAM, which makes it possible to reduce latency by several times and at the same time increase linear speeds by an order of magnitude right now. Apple did this but at the cost of desoldering the memory.

If x86 comes to this, it's a problem in terms of system memory upgrades. When more is needed...
So I would like to pick your brain for a moment. I see photonics as the future but there is still a lot of room for improvement in the silicon enviorment. AMD has created something that is in excess of 2TB/s, where do you think we can immediately go from here? It is my opinion we are atleast a decade away from photonics. Once we get to that point we'll have ram faster than L3 cache
 
Expert systems based on neural networks require huge computing resources, huge memory bandwidth and a huge amount of RAM. Because up to the level of abstractism of the human brain, modern concepts and even more so the implementations of neural networks are like walking to the moon. Namely, the level of abstraction of the human brain allows you to pack a huge amount of information from the real world into compact biological structures. But even the human brain is an extremely flawed machine in terms of data integrity, their long-term reliable storage.

The population is now being heavily peddled with "AI", which is not even close and will not be in the near future. Especially with mass, personal access.

The x86 industry, which was so embarrassed by the success of Apple, now at least catch up with Apple in terms of the linear speed of RAM. After all, the impasse with slow memory on x86 is already obvious to all experts. All these desperate attempts to increase the cache are useless.
 
silicone chips have no future.
Give me a gallium/indium (give or take an element or two) arsenide chip fab that can place the molecules one by one (molecule-level construction) and we'll see. No more issues with defects and impurities, plus advanced AI to outperform chip designers in the way NASA engineers are being shown up with designs for mechanical devices (strength to weight and such).

That should provide quite the boost for traditional chip performance.

You say AI is not even close and yet, the results NASA is getting disagree.
 
True, however they were stuck on their "14nm" process for several generations, although it's not that simple. Intel didn't start to really be competitive until AMD forced it's hand using TSMC to make its chips.

Intel has made enormous contributions to chip design but they can only stand on their past accomplishments for so long. They did nearly stagnate until basically the 12000 series
Intel past bosses thought buying expensive asml euv machine wasn't good business decision
 
The increased L3 cache in Intel/AMD processors is just a crutch like the SLC cache in an SSD. Outside it, the read / write speed immediately drops by an order of magnitude. This is never an option.

They have 2 options:
1. Placing on the SoC chip 512-1024 bit HBM memory in the amount of 8-10GB as a dedicated VRAM.
But in this case, the whole system, taking into account the transition to pci-e 5.0, still turns out to be a bottleneck for a bunch of high-performance devices in terms of the speed of the memory controller.

2. 512+ bit memory controller like Apple in M2 Max with 400Gbytes/s throughput vs shamed only 80Gbytes/s for top i9 Raptor.
In the this case, a minimum of 8 SO-DIMM channels will be required. Or increasing the channel width of a single SO-DIMM to 256 bits or more.
Neither solution is an ideal replacement for what's currently used in desktop systems, though. Any permanently soldered system of modules wouldn't offer the user any means of upgrading the RAM capacity -- it works for Apple, simply because it's Apple.

The wider bus of the M2 Max is simply because Apple used more memory controllers in its design, specifically 8 x 64-bit LPDDR5. Desktop CPUs make do with two, whereas workstation/server models such AMD's Epyc have eight. So more bandwidth is available, but again, how many everyday users are going to buy 8 DIMMs just to get that kind of bandwidth?

Of course, this is where HBM comes in, as the modules (and associated memory controllers) are much wider. Samsung's HBM3 Icebolt, currently its fastest stacked DRAM, is cumulatively 1024 bits wide and runs at 6.4 Gbps, for just under 820 MB/s per module. Sounds great (and it is!) but the need to integrate a controller die into the module and the use of an additional interposer for the packaging only adds to the cost. HBM, regardless of its form, is way more expensive than non-stacked DRAM.

L3 cache latency has almost doubled in terms of access time!
Doubled since when, though? It's pretty bad on Intel's E-cores in Alder/Raptor Lake (60 to 65 cycles) but one has to go all the way back to the likes of Prescott to see L3 latencies half that value (and even then, only in linear transactions). Besides, AMD's K10 L3 latency was 40 cycles on average, whereas Zen 3 is around 50, so it's not like it's universally doubled across every CPU micro-architecture.
 
Doubled since when, though?
Tests show that Raptor's L3 cache access time has almost doubled to almost 20ns, instead of 10 typical for old processors 5-10 years ago.

All this leads to the effect of increased system latency, because. and the overall latency of DDR5 is 2 times higher than that of DDR3.

Line speeds are growing, but the speed of random access in small blocks is constantly falling. And it is in neural networks that this affects the most, where myriads of independent nodes ...

We need a new computing architecture. The Von Neumann and Turing architects are dead.
 
Tests show that Raptor's L3 cache access time has almost doubled to almost 20ns, instead of 10 typical for old processors 5-10 years ago.
Which particular tests? This one (Chips and Cheese) shows 64 cycles for L3 for the P-cores in a 4.7 GHz ring clocked i7-12700K, whereas this one (ixbtlabs) shows 35 to 61 cycles for a 3.2 GHz P4. So that's 13.6ns for the Alder Lake L3 and 10.9ns to 19ns for the Gallatin L3 -- that's not double at all.
 
AIDA64 Cache & Memory.
AIDA64's latency metric is somewhat volatile, though. For example:

12600K - 17.1 ns
11600K - 11.2 ns

12900K - 15.3 ns
13900K - 13.1 ns (engineering sample by the looks of it)

4770 - 16.2 ns

While that's a very random selection of results and each version of AIDA64 used is different, the cache test didn't significantly change (if at all) across those versions. So one has a 3.9 GHz 4770, with 8MB of L3 cache, with a latency 45% longer than a 4.9 GHz 12600K with 20MB of L3 -- yet it only has a 20% slower clock and 60% less cache.

I have no doubt that one can find a set of AIDA64 results that seemingly shows a Raptor Lake with a latency double that of an old Pentium from years ago, but if one goes too far back in time, and uses AIDA64 from that period, then the figures aren't comparable, as the test was changed in 2013, increasing the latency test results.

The test itself doesn't seem to be particularly sensitive to cache size or, at the very least, the cache controllers cope very well with what the test is demanding of them. For example:


5800X - 11.0 ns (32MB L3), 3.8 to 4.7 GHz
5800X3D - 13.4ns (32+64MB L3), 3.4 to 4.5 GHz

That's a 22% increase in latency, yet there's 200% more L3 cache and the difference in clock speeds is only 4%.
 
There are no random results, that's not true. In the absence of background processes, AIDA64 shows close results all the time,

My old i5 750@3800 2009 has 9.5-10ns on L3 cache. i7 4700HQ has ~12.9ns. i5 8300H has ~11.7ns. It keeps getting worse and worse in newest version. At 17-18ns, we get a 2-fold deterioration in the latency of the L3 cache. Q.E.D.

It's the same with RAM DDR3 on i5 750 - latency is only 51ns, how many new ones? Almost twice as bad (90-100ns) is typical case...
 
There are no random results, that's not true.
I said it was a random selection - I.e. just pulled a handful of AIDA results off the web, with no selection criteria other than they covered Intel CPUs and involved a large gap in the architecture release year.

My old i5 750@3800 2009 has 9.5-10ns on L3 cache. i7 4700HQ has ~12.9ns. i5 8300H has ~11.7ns. It keeps getting worse and worse in newest version. At 17-18ns, we get a 2-fold deterioration in the latency of the L3 cache. Q.E.D.
It's only a demonstrare if there's actual proof of results, such as screenshots of the various AIDA results being stated.

It's the same with RAM DDR3 on i5 750 - latency is only 51ns, how many new ones? Almost twice as bad (90-100ns) is typical case...
And yet, the CapFrameX screenshots showed 40 to 60 ns for an Alder Lake and Raptor Lake. It's entirely possible that AIDA64 just isn't a reliable enough micro-benchmark for cache and DRAM latency.

Anyway, this is taking the comments thread off-topic. If you wish to discuss this further, feel free to create a new thread in the CPU forum section.
 
I read an article that said an Apple II has much less latency than modern PC hardware, which is one of the reasons some scientific instruments still use them. Then, add the sloth from the Windows operating system to go with the PC hardware. Look at the size of the driver just for the sound bits on the motherboards these days.

I would also suspect that AIDA, as it runs atop Windows on a platform with a complex motherboard, isn't as accurate as a benchmark that is more low-level, even if its results are fairly repeatable (which I'm not claiming, as I am no AIDA expert). CPUs also have their invisible piggybacked CPUs for 'security' which might have some influence in fine-grained benchmarks. NICs take CPU time, et cetera.

And, this is on-topic because without benchmarks to prove performance claims, how can we meaningfully discuss them?
 
t's only a demonstrare if there's actual proof of results, such as screenshots of the various AIDA results being stated.
If you do not trust me, how can you trust a screenshot that I can easily change to my advantage?

I assume that you trust my numbers or verify them with many similar screenshots from WWW.

The only way to believe you is to post a video shot on a smartphone of how these tests pass on my platforms that I indicated, but I'm just lazy to do this.

Trust people. Or check the result
t's only a demonstrare if there's actual proof of results, such as screenshots of the various AIDA results being stated.

s yourself against a variety of sources. These figures are real results taken by me literally yesterday on 2 of my laptops and an old PC.
various AIDA results being stated.

It is not possible to insert a picture from a file on your forum. I can easily attach screenshots from my equipment.
And yet, the CapFrameX screenshots showed 40 to 60 ns for an Alder Lake and Raptor Lake. It's entirely possible that AIDA64 just isn't a reliable enough micro-benchmark for cache and DRAM latency.
AIDA64 results are stable over time and accurate. Spread is a disgusting memory tuning (which I write about all the time on many forums and even got some major testing sites to include these tests in their reviews) by manufacturers of different laptops and motherboards. Those, this is the result of poorly made settings in bios.
 
Back