How CPUs are Designed, Part 4: Where is Computer Architecture and Design Headed?

I remember Ivy Bridge breaking down walls when it launched with 3D transistors, as reported by TechSpot. That was a pretty significant update and one of the bigger ones in the 9-10 generations of Intel's architecture.
 
An interesting discussion can be had on the industry or market requirement for the future of computing. With existing technology industries can already do many great things most of which are not even being used by companies to their fullest capability. I don't see expansion jumps in tech likely until their is time to learn how to optimize and implement existing technology comfortably.
 
What a fantastic series. Beautifully researched and written start to finish. So many sources today are just waffle to fill web pages. It's nice to come across something so readable and sharp, yet still dense and well-structured over four installments.
 
This was a GREAT series. I think you should do the same for GPUs, I get they work rather similarly, but I bet you can spin it so that we can better understand what all those number in our GPU's spec card mean.

GREAT JOB!!!
 
Dear Sirs:

I read your study of CPU technology with great interest. However there is one point I do not understand. Maximum x86 CPU speed has gone from 3.8 Ghz to ~5.5 Ghz ( + 47% ) over the past ~7 Years. During the same period, Maximum SDRAM speed has seemingly hit a wall at about 500 Mhz
( 2 ns ). Can you please explain why you believe this very similar Semiconductor Process Technology ( 10nm features, Ion implanted on 300mm SiDiO Wafers ) can not seem to improve any more as CPU technology has ? How likely do You feel it is that any maker ( I.e. Micron, Samsung, SK Hynix & etc. ) will be able to finally Design or Buy a significantly faster SDRAM Transistor and Mass produce it, say in the next 2 to 3 Years ?

It is our experience that going beyond 4 Cores/ 8 Threads gains little in Throughput,
only a lot of Waste Heat & Marketing Hyperbole. As You All said, CPU vendors have been compensating
for slow RAM main memory , with larger L3 Caches. These are so big now ( 80 MB ) that Hit rates are up to ( 96% ), so adding more will do little good, just a waste of Heat & Xsistors. Simultaneously the Hash tables must grow too and cause much longer Lookup times and will slow SRAM Access times down to near DRAM speeds. The results are negative Cost/Benefit Ratios, for using any more SRAM !

Sincerely,
dennis k. b.
msee/cs cne
 
Memory access routines are still relatively slow and CAS latency in DDR4 SDRAM, for example, is still in the region of 12 to 20 ns, as the number of cycles required increases with CLK rate; there's also the matter to consider that raising the CLK line requires every part of the SDRAM module to be faster, DRAM capacitors still need to be refreshed regularly, and so on. Hence why all SDRAM development has centred around keeping the throughput as efficient as possible, through pipelining, etc as this is cheaper to achieve, for something that needs to be mass manufactured, compared to pushing for higher CLK rates.

Edit: L3 caches are significantly larger than they ever were because of the multicore approach to CPU design; per core, it hasn't changed a great deal for the past few years.
 
Hi Neeyik,

Yes, technically. However that does Not explain the huge discrepancy.
MOSFETs using virtually the same Materials, FAB equipment, Process technology
( Masking, Lithography, Implanting, Etching, Wash, Rinse, Repeat ) & Design tools,
the best any SDRAM vendor can do is ~500 Mhz. They use Folded bit lines to attain
higher Frequencies. Capacitors tied to Vcc/2, not Gnd, are faster, bet they do this too.
True, they use extremely tiny Caps., to save space/cost, only this does cause an
increase in Latency, unless one can cut the parasitic Capacitance of the Bit lines.
That only explains part of it, the rest has got to be in the Transistor itself. Then
simultaneously CPU makers are able to get their Passive components ( Capacitors,
Resistors, Vias, et al ) and Transistors, ( the Passives are fundamentally alike, only
the Xsistor is significantly different ) to Slew at 10 to 12 Times the Rate
( ~ 1100 % faster), as if by some kind of new found Magic!

The Claim regarding Total Cache per Core is inaccurate.
For example, Total Cache/Core: 80486 8KB, P1 32K, P2 128K, P3 512K, P4 1MB,
i5 1st-3rd Generation 1M, i5 4th-6th Gen. 1.5M, i5 7th-9th Gen. 2.25M, & R3600X 6MB

If AMD doubles Cache @ 5nm, Hit rate may go from 95% to ~97, so the Lookup time
is twice too, the Average Through-put gain will be /2, for a net of almost none
( Law of Diminishing Returns ). Else if they go to 5nm /w the same Caches and optimize
most every other item, as much as they can( likely gain over 10% based on past History ),
then they may turn the Clock ( CLK ) up to ~5.5 Ghz, staying within the same Heat envelope.
The net increase in the 1st case, at most 2%, for the second, >30%, best approximations.

By cheaper, I must assume you meant the higher cost of switching substrates from SiO2
to GaA? After ~40 Years of research Scientists @ MIT recently discovered that by Plating an
ordinary SiDiO Wafer with a very thin layer of Grapheme and then flowing pure Gallium &
then pure Arsenide on it, they can match or beat the cost of Silicon DiOxide, the question is
how long before it is perfected and ready for Mass production? With 5nm production now
certain & 3nm nearly so, we will have much faster CPUs soon. As ere, I am just trying to
see how SDRAMs can be improved, to catch up and hopefully keep-up? HBM would be
better than DDR5 ( denser, faster & less power ), were it not for its persistently high price.

DDR5-5200 sounds impressive, except that 90% of that Data is waste, so it is actually
more like DDR-520. It was done out of desperation, they could not make QDR work after
all and they still can not seem to get the charge Pump to toggle the States any quicker.
Even today the typical Compiled X86 program only executes 8 to 10 Instructions,
before it is forced to Branch. Despite all Intel's contrary PR, x86 Cores do not yet do
Branch Prediction/Out of Order Execution to well and so they just dump more and more
unused Data into the Bit Bucket. They also turned up the Word Clock (WCK), only the
Latency ( CL ) just goes up right along with it ( so must add a zillion T States, I.e. >23 ),
for a tiny net gain, unless hyperbole counts?

As CPU, GPU, PCI, SSD & USBs all become so swift, SDRAM will be the growing chokepoint.
Mem-makers spend $25+ Billion a year on new FABs and R & D, seems as if they can afford
to create a much more elegant solution? In perspective, Applied Mat. estimates it will cost
$10+ Billion just to develop the 1st 450mm Prototype. You are quite right, it is all so complex
and costly now a days. They did do a great job of upping the Density and Longevity,
cutting Power use and Error rates and lowering the Cost per bit stored !
The JEDEC has hinted their DDR5 Specification may finally support QDR, by about Xmas.


Thank You, dennis
 
@denniskb

First of all, I am not a semiconductor fabrication expert, however simple physics helps us understand the difference in speed between RAM and CPU cache.

1. The speed of light is fast, but is no longer negligible, especially at the clock speeds used now.
2. The speed of an electric current in a solid is *much* slower than in a vacuum, often by orders of magnitude.
3. The distance between a memory chip/stick and the CPU is millions, if not billions, of times further than the distance between an on-chip cache and its associated CPU.
4. Even if we assume that memory access is infinitely fast, the speed of the signal down the individual circuit board traces isn't and can take many hundreds of clock cycles simply to get to the CPU. Add to this the need to guarantee that all the bits arrive at the same time, delay lines are built onto the circuit board. (These are the odd shaped "squiggly" traces.)
5. Memory manufacturers have to choose between sheer size and speed. (Dense memory is slower.) At current manufacturing densities, a relatively small amount of single layer memory - on die - is hugely faster than the gobs and gobs of multi-layer memory in a typical RAM device. Even if a 16 gig RAM module was built on the processor die itself, it would still be many orders of magnitude slower than the single-layer cache on the same die.
6. Memory devices use their own on-chip processor to manage memory access and timing on the device. This helps make the device faster, but is still another layer of complexity.

And so on.

Processor, motherboard, and memory manufacturers know all about this and have been beating their heads against the ground for decades trying to solve these problems, while manufacturing devices that mere mortals can afford.

It's not a simple task.

What say ye?

Jim "JR"
 
Back