Explainer: L1 vs. L2 vs. L3 Cache

Dyper

Posts: 17   +5
I want to hear how these layers overcome latency and how AMD’s upcoming unified cache will make the 12 and 16 core designs as fast as the 8 core Matisse CPU’s.
3700X is a great chip for audio and gaming but it dogs down in audio apps at a much lower usage level than a Intel Chips. Their 12 and 16 core chips scale the same way, more chips more latency. Unified cache will overcome this, but would love to hear how.
Fascinating topic for an end user like me. Cheerz.
 
  • Like
Reactions: Reehahs

BoboOOZ

Posts: 63   +53
I've had an AMD K6 II, with the cache located in a cartridge outside the processor. And then I had a K7 Duron with a much faster, but smaller on-chip L2 cache. Both were great value in their time...
 

neeyik

Posts: 1,294   +1,356
Staff member
  • Thread Starter Thread Starter
  • #4
I want to hear how these layers overcome latency and how AMD’s upcoming unified cache will make the 12 and 16 core designs as fast as the 8 core Matisse CPU’s.
The key to overcoming latency with cache (or any computing structure, for that matter) is hiding it underneath other parallel processes. The average access-to-use time for the various levels is non-linear: in Zen 2, L1 is roughly 5 to 7 cycles, L2 is about 14, and L3 is around 40. L1 and L2 cache is inclusive and private to a particular core, I.e. the contents are exclusive to the logic units in that core only. Only the L3 cache is accessible to other cores, but it's a victim cache, so it only stores data that's released from L2.

So let's say core #5 wants a particular piece of data - L1 cache is always checked first and if it's there, it gets used. If not, a check of L2 is skipped automatically, and the memory sub-system goes onto to work through L3, RAM, and eventually the main storage. The thread that required the data will be stalled until the information is available, but if SMT is enabled, the core can still be used to process another thread.

The branch predictor and other cache in the core also helps to mitigate this problem, by trying to determine what data is going to be required ahead of time, and holding branches in the cache until its needed and/or the data is ready.

As for unified cache, technically this just means it's not separated in any way (e.g. L1 is split cache, L2 is unified). L3 already is, by this definition, unified as there isn't a data/instruction/constants type of separation.

However, when we talk about Zen 3 and unified cache, this is referring to the fact that in Zen 2, cross-core access to L3 is confined to a CCX - so cores 0 to 3 in CCX 0 can directly access each other's L3; whereas cores 4 to 7 in CCX 1 can't. In order for, say, core 6 to access the L3 in CCX 0, the request has to be managed (along with the data flow), by the I/O processor.

The general expectation for Zen 3 is that either the management of the two CCX's L3 cache will be managed by a system within the CCD, or the chiplet will contain just a single 8 core CCX. Both methods reduce the overall latency, but former will be simpler to implement that the other.
 

corrosive23

Posts: 14   +32
Pentium 2 did NOT have an L2 cache on the motherboard. They were on the processor just not on the die. They were on the cartridge next to the die.


The chip on the side is the cache.

The original K6-2 had a 64 KB primary cache and a much larger amount of motherboard-mounted cache (usually 512 KB or 1024 KB but varying depending on the choice of the motherboard). The K6-III, with its 256 KB on-die secondary cache, re-purposed the variable-size external cache on the motherboard as the L3 cache. This scheme was termed "TriLevel Cache" by AMD. The L3 cache has a capacity of up to 2 MB.
 

Julio Franco

Posts: 8,662   +1,537
Staff member
Pentium 2 did NOT have an L2 cache on the motherboard. They were on the processor just not on the die. They were on the cartridge next to the die.
Correct, no one suggested L2 was on the motherboard. Pentium II and Celerons of that era came in a daughterboard.
 
  • Like
Reactions: Reehahs

Lew Zealand

Posts: 1,557   +1,547
TechSpot Elite
I have a 32K L2 cache card for a Mac IIci in a box around here somewhere. That was the first time I'd heard of the concept of a cache.

I also have a 1MB PCI Mac L2 cache card as a Christmas ornament from OWC. To think I'd spent $160 or so a half decade earlier to speed up my Mac by ~25% and by then they were trash to be added to another sale for free.

A clever way that CPU upgrade companies added G3 CPUs to older non-G3 Mac clones was to design a G3 CPU upgrade that slotted into the L2 cache slot. I have a 300 MHz one of those probably in the same box as the IIci cache card.
 

corrosive23

Posts: 14   +32
Correct, no one suggested L2 was on the motherboard. Pentium II and Celerons of that era came in a daughterboard.

Hell, the first 2 Celerons did not even have L2 cache. It was not until the 300A that there was L2, and then it was half sizes but at full speed.
 

Dyper

Posts: 17   +5
Nice history lessons. Any information on cache expected to show itself in Tiger Lake would be most appreciated too.
I still have a spare i7 5775C that performs @ 3.3GHz that has equal performance as my i7 4790k PCs.
It had the 128MB of L4 (victim cache) for the Iris GFX. By disabling the Graphics in the BIOS and using a GTX discrete graphics card, that 128MB was used on my audio apps and helped immensely.
This is why when I see Intel or AMD boosting their cache I tend to get a trifle excited.

Thanks
 

neeyik

Posts: 1,294   +1,356
Staff member
  • Thread Starter Thread Starter
  • #11
Any information on cache expected to show itself in Tiger Lake would be most appreciated too.
Not much is out in the wilds at the moment. This news report, if showing genuine details on Tiger Lake, suggests some notable changes:


Compared to Comet Lake, L1 instruction is the same size at 32 kB, but L1 data has increased by 50% to 48 kB. L2 is 1.25 MB, though - 6 times greater. L3 is 12 MB in total, for 4 cores, so that's gone up by 50% too.

It had the 128MB of L4 (victim cache) for the Iris GFX. By disabling the Graphics in the BIOS and using a GTX discrete graphics card, that 128MB was used on my audio apps and helped immensely.
It was a shame that Intel dropped it, even though it added a fair bit to the package costs.
 
  • Like
Reactions: Dyper

Dyper

Posts: 17   +5
Not much is out in the wilds at the moment. This news report, if showing genuine details on Tiger Lake, suggests some notable changes:


Compared to Comet Lake, L1 instruction is the same size at 32 kB, but L1 data has increased by 50% to 48 kB. L2 is 1.25 MB, though - 6 times greater. L3 is 12 MB in total, for 4 cores, so that's gone up by 50% too.


It was a shame that Intel dropped it, even though it added a fair bit to the package costs.
Intel had a winner there but as you noted the profit margin wasn’t the model they sought.
Tiger Lake Desktop would really be nice if the cache stayed consistent.

Have a feeling Intel seeks revenge more than profit next year.
Being whooped by AMD, then chasing the leader dropping your profit margins the entire way has to be motivating.

Thanks for your replies, much appreciated.
 

candle_86

Posts: 514   +382
Real fast you missed early computers, the 8088 and 286 did not have cache at all, as the cpu was not overly bottlenecked by dram at the time, most if not all 386sx boards also lacked cache, but slot of 386dx boards gave you the ability to socket between 32k and 128k cache, the 486 introduced 8kb onboard cache with most boards supporting 64-256k while later boards could support up to 1mb l2.
The reason for this is memory speeds where stuck at 70-80ns and needed wait states to work so cache was the solution.
 
Pentium 2 did NOT have an L2 cache on the motherboard. They were on the processor just not on the die. They were on the cartridge next to the die.


The chip on the side is the cache.

The original K6-2 had a 64 KB primary cache and a much larger amount of motherboard-mounted cache (usually 512 KB or 1024 KB but varying depending on the choice of the motherboard). The K6-III, with its 256 KB on-die secondary cache, re-purposed the variable-size external cache on the motherboard as the L3 cache. This scheme was termed "TriLevel Cache" by AMD. The L3 cache has a capacity of up to 2 MB.
Quite right. I still have my PII system running to this day. The cache is indeed mounted to the right of the CPU (took the cover off to check).
 

Markoni35

Posts: 750   +263
Long time ago I thought that one day each RAM location will have the ability to do basic math calculations. Or at least, each block of RAM will have its own ALU. That would bring massive parallelism, incomparable to even 64-core CPUs, but it would also greatly increase the complexity, power consumption and price.
 

Dyper

Posts: 17   +5
The key to overcoming latency with cache (or any computing structure, for that matter) is hiding it underneath other parallel processes. The average access-to-use time for the various levels is non-linear: in Zen 2, L1 is roughly 5 to 7 cycles, L2 is about 14, and L3 is around 40. L1 and L2 cache is inclusive and private to a particular core, I.e. the contents are exclusive to the logic units in that core only. Only the L3 cache is accessible to other cores, but it's a victim cache, so it only stores data that's released from L2.

So let's say core #5 wants a particular piece of data - L1 cache is always checked first and if it's there, it gets used. If not, a check of L2 is skipped automatically, and the memory sub-system goes onto to work through L3, RAM, and eventually the main storage. The thread that required the data will be stalled until the information is available, but if SMT is enabled, the core can still be used to process another thread.

The branch predictor and other cache in the core also helps to mitigate this problem, by trying to determine what data is going to be required ahead of time, and holding branches in the cache until its needed and/or the data is ready.

As for unified cache, technically this just means it's not separated in any way (e.g. L1 is split cache, L2 is unified). L3 already is, by this definition, unified as there isn't a data/instruction/constants type of separation.

However, when we talk about Zen 3 and unified cache, this is referring to the fact that in Zen 2, cross-core access to L3 is confined to a CCX - so cores 0 to 3 in CCX 0 can directly access each other's L3; whereas cores 4 to 7 in CCX 1 can't. In order for, say, core 6 to access the L3 in CCX 0, the request has to be managed (along with the data flow), by the I/O processor.

The general expectation for Zen 3 is that either the management of the two CCX's L3 cache will be managed by a system within the CCD, or the chiplet will contain just a single 8 core CCX. Both methods reduce the overall latency, but former will be simpler to implement that the other.
Love the sharing of your knowledge. Just a lowly synthesist who knows the importance of cache for audio apps.
I loved the i7 5775C because it had 128MB of L4 victim cache.
It was designed to supplement the on die Iris GFX.

I put a discrete GFX card in it and was surprised to se the L4 was now dedicated to my audio apps. Indirectly I suppose, but I got the same performance from my i7 4790k’s running 1GHz faster. If that cache would have been meant for all instructions it would’ve been a monster.

Thanks Again For Chiming In.