New multi-threading technique promises to double processing speeds

hwertz · Feb 27, 2024

Re: Itanium. When I was a student at the U of I, the engineering department got an Itanium to replace their HP Superdome (my understanding is this was to bolster Itanium sales, they got something like a $200,000 system for like $1000). They found the several-year-old PA-RISC-based Superdome was faster than the Itanium-based Superdome that was intended to replace it! Compiler improvements did indeed help here to some extent (eventually, the Itanium narrowly outran the previous system), but even if they'd doubled or tripled the speed, they would have had to come out with faster Itanium models at a fairly rapid pace to have them keep up with the fastest chips coming out from others.

I will note, the one thing Itanium did REALLY well was knock out many of the competing CPUs -- the two fastest CPU lines on the planet at that point were the PA-RISC and the Alpha, and HP dropped development on both in favor of Itanium; SGI abandoned MIPS development in favor of Itanium; Sun dropped SPARC development in favor of Itanium (Oracle resumed development of SPARC, but given the cost and bespokeness of Oracle systems I don't know anything about these). Basically the CPU lines you had survive the fallout were IBM POWER (IBM was not interested in Itanium), ARM (which was not a factor in desktops or servers back then), and MIPS (but in embedded systems, not desktops or servers), and of course x86/x86-64 (with AMD coming out with the first x86-64 chips.. thus the architecture being referred to as "amd64" in some Linux distros... since Intel was assuming their 64-bit chip would be Itanium...).

As for the new multi-threading technique -- neat! And the nice thing is, it sounds like something that could be added to tensorflow, pytorch, and things like this where the user may not have to set up that much at all. Split up jobs between AI accelerator (if any), GPU and CPU rather than just pushing EVERYTHING onto the GPU and leave your sometimes quite powerful CPUs nearly idle; or onto the AI accelerator leaving both GPU and CPU largely idle as the case may be.

Having used a Tegra K1, I can say the GPU on there is about the speed of an Nvidia GTX650 (I had a K1 and a GTX650 and they were about dead even.) And the quad-core ARM, you know about how fast that is (not very fast but not terrible, more or less.) You might have trouble with "Amdahl's law" (some single-threaded portion of code making it slow & difficult to dispatch stuff fast enough to many CPU, GPU, and AI cores), but even if some dispatch thread is not fast enough to fully feed all GPU and CPU cores, it's still going to be a bit faster to feed GPU and *some* CPU cores than just run on the GPU alone.

HardReset · Feb 28, 2024

Endymio said:
Great. It was evident to everyone else from the first post. Process improvements are just that: smaller, faster transistors. Nothing else.

So Jensen is trying to say transistor technology development has been slowing down because they cannot get more clock speed for their GPUs? Again, that's just BS. Cherry picking stuff:

Intel Pentium 4 670 (2005): 3.8 GHz
Intel Core i7-6700K (2015): 4.2 GHz
Intel Core i9 14900K (2024): 6 GHz

So process technology development is not slowing down but had been much better in last 10 vs 20 years ago. Process technology improvements are always about smaller and/or faster and/or more energy efficient transistors. Jensen is just promoting Nvidia saying BS as usual, nothing else.

Endymio said:
Yet on model training and other AI tasks, their newest products are as much as one thousand times faster. Not just a mere 20x. Seems Jensen actually understands his products after all.

As usual, performance "gains" are huge when not even trying to do something before. To put it another way, take any special instruction set on CPUs (SSE, MMX, AVX etc.) and create benchmark program that requires that insruction set and does not work without. Results will be something like this:

CPU with instruction set: xxx points
CPU without instruction set: 0 points

CPU with instruction set is not 1000 but Infinite times faster. Ohh.

HardReset · Feb 28, 2024

hwertz said:
Re: Itanium. When I was a student at the U of I, the engineering department got an Itanium to replace their HP Superdome (my understanding is this was to bolster Itanium sales, they got something like a $200,000 system for like $1000). They found the several-year-old PA-RISC-based Superdome was faster than the Itanium-based Superdome that was intended to replace it! Compiler improvements did indeed help here to some extent (eventually, the Itanium narrowly outran the previous system), but even if they'd doubled or tripled the speed, they would have had to come out with faster Itanium models at a fairly rapid pace to have them keep up with the fastest chips coming out from others.

I will note, the one thing Itanium did REALLY well was knock out many of the competing CPUs -- the two fastest CPU lines on the planet at that point were the PA-RISC and the Alpha, and HP dropped development on both in favor of Itanium; SGI abandoned MIPS development in favor of Itanium; Sun dropped SPARC development in favor of Itanium (Oracle resumed development of SPARC, but given the cost and bespokeness of Oracle systems I don't know anything about these). Basically the CPU lines you had survive the fallout were IBM POWER (IBM was not interested in Itanium), ARM (which was not a factor in desktops or servers back then), and MIPS (but in embedded systems, not desktops or servers), and of course x86/x86-64 (with AMD coming out with the first x86-64 chips.. thus the architecture being referred to as "amd64" in some Linux distros... since Intel was assuming their 64-bit chip would be Itanium...).

This "Itanium" killed some CPU architectures have been told long time but I just don't get it. PA-RISC was already obsolete and HP decided to partner with Intel to replace it. Alpha was never commercial success. As for other "abandoned" CPU lines, as usual, high end technology market trims naturally and leaves only few players. Just like has happened with high end manufacturing nodes. There were dozen high end manufacturers, not only few remain.

I would say neither MIPS, Alpha or Sparc would have survived long even without Itanium ever existed in first place. I remember well how Microsoft debated what will be x86 successor and therefore what CPUs should Windows NT support. Microsoft was already dropping (or dropped) NT support for many CPUs indicating Microsoft too wanted less players on market.

For obvious reasons AMD didn't care at all about Itanium but they also believed on their own development (x86-64). Other companies dropped out because it was no longer financially worth it and abandoning would have became sooner or later in any case. I wouldn't claim Itanium had any actual role on those.

ZedRM · Feb 28, 2024

ScottSoapbox said:
As physics starts to limit die shrinks

THAT has already begun. It started right around the time when IC processes reached the 20nm scales. This is why it took Intel so long to move from 22nm to 14nm to get off of the 14nm process. TSMC has only found work-around solutions that take advantage of building IC in 3 dimensions. The 7nm, 5nm and 3nm processes they are claiming are in name alone and represent only the smallest estimated size of the internal functional parts of the circuit being built. Such scale claims are not of each functional transistor unit.

HardReset · Feb 28, 2024

ZedRM said:
THAT has already begun. It started right around the time when IC processes reached the 20nm scales. This is why it took Intel so long to move from 22nm to 14nm to get off of the 14nm process. TSMC has only found work-around solutions that take advantage of building IC in 3 dimensions. The 7nm, 5nm and 3nm processes they are claiming are in name alone and represent only the smallest estimated size of the internal functional parts of the circuit being built. Such scale claims are not of each functional transistor unit.

Intel 22nm: 2012
Intel 14nm: 2014

Two years is not much at all.

Endymio · Feb 28, 2024

HardReset said:
So Jensen is trying to say transistor technology development has been slowing down because they cannot get more clock speed for their GPUs? Again, that's just BS. Cherry picking stuff:

Intel Pentium 4 670 (2005): 3.8 GHz
Intel Core i7-6700K (2015): 4.2 GHz
Intel Core i9 14900K (2024): 6 GHz

So process technology development is not slowing down

Oops! Your first two chips run at those clock speeds continually. Whereas the 14900K has e-Cores that run at a mere 2.4GHz, and can briefly boost to 4.4 Ghz, and p-Cores that run at 3.2Ghz and can briefly boost to 6GHz. Averaged across cores and time, it's no faster than the 2015 chip on a pure clock cycle basis.

And of course you utterly ignored the word "smaller" in "smaller and faster". Moving from Intel's 2005 90nm node to their 2015 10nm node took transistor densities from 1.55M/mm-1 to 37.22M/mm-1. That's a whopping 2400% increase. Whereas Intel's current 10nm node has a density of only 100.76M/mm-1, an increase of just 270%.

Yes, process technology **is** slowing.

HardReset said:
...performance "gains" are huge when not even trying to do something before.CPU with instruction set is not 1000 but Infinite times faster. Ohh.

You missed the forest and the trees. I'll ignore that the tasks these chips "have been trying to do" aren't new at all, and if NVidia had created a new instruction set that allowed these same tasks to run 1000x faster, that would still be a thousand-fold increase.

Instead, I'll just point out that your basic premise is wrong. The Tensor Core instruction set hasn't appreciably changed in the last 10 years. NVidia has made architectural changes that allow those same instructions to run many times faster.

HardReset · Feb 28, 2024

Endymio said:
Oops! Your first two chips run at those clock speeds continually. Whereas the 14900K has e-Cores that run at a mere 2.4GHz, and can briefly boost to 4.4 Ghz, and p-Cores that run at 3.2Ghz and can briefly boost to 6GHz. Averaged across cores and time, it's no faster than the 2015 chip on a pure clock cycle basis.

And of course you utterly ignored the word "smaller" in "smaller and faster". Moving from Intel's 2005 90nm node to their 2015 10nm node took transistor densities from 1.55M/mm-1 to 37.22M/mm-1. That's a whopping 2400% increase. Whereas Intel's current 10nm node has a density of only 100.76M/mm-1, an increase of just 270%.

Yes, process technology **is** slowing.

When talking about manufacturing process and clock speeds, we naturally take boost clocks because that is where manufacturing tech sets limits. Also because Intel had much troubles with 10nm tech, Intel is pretty much worst example about transistor density.

Like I said, Jensen can "prove" anythin when cherry picking only certain things.

Endymio said:
You missed the forest and the trees. I'll ignore that the tasks these chips "have been trying to do" aren't new at all, and if NVidia had created a new instruction set that allowed these same tasks to run 1000x faster, that would still be a thousand-fold increase.

Instead, I'll just point out that your basic premise is wrong. The Tensor Core instruction set hasn't appreciably changed in the last 10 years. NVidia has made architectural changes that allow those same instructions to run many times faster.

I didn't miss anything. How about Jensen showing Nvidia could make that difference while maintaining everything else, including transistor count, same. But no, Jensen is trying to claim that 8x increase in transistor count and 2.5x increase in clock speed equals 2.5x increase from manufacturing processes.

As for tensor core instruction set, I doubt Nvidia made that code run faster just redesigning cores while keeping transistor count same. It's more about Nvidia adds more and/or beefier cores while still keeping transistor count reasonable. This is very evident when looking at AMD Zen4 vs Zen3. Architecture wise, Zen4 is much beefier than Zen3 but despite having 58% more transistors, die size is not bigger but smaller! In other words, better manufacturing tech enables AMD to make architectural improvements without exploding die size.

For Nvidia, it was manufacturing process development that enabled those architectural improvements to even be possible. In other words, If Nvidia wanted to make RTX 4090 on 28nm tech, die size would have been somewhere around 6000 mm2. Good luck with that. Transistor density development for Nvidia went from 13M/mm2 to 125M/mm2. That's almost tenfold improvement, not 2.5 Jensen is trying to explain.

ZedRM · Feb 28, 2024

HardReset said:
Intel 22nm: 2012
Intel 14nm: 2015

Three years...

Fixed that for you. Yes, 3 years 4 months is a bit of time for a lithography advance.

HardReset · Feb 28, 2024

ZedRM said:
Fixed that for you. Yes, 3 years 4 months is a bit of time for a lithography advance.

Intel® Core™ M-5Y10 Processor (4M Cache, up to 2.00 GHz) - Product Specifications | Intel

Intel® Core™ M-5Y10 Processor (4M Cache, up to 2.00 GHz) quick reference with specifications, features, and technologies.

ark.intel.com

ScottSoapbox · Feb 28, 2024

ZedRM said:
"As physics starts to limit die shrinks"

THAT has already begun. It started right around the time when IC processes reached the 20nm scales. This is why it took Intel so long to move from 22nm to 14nm to get off of the 14nm process. TSMC has only found work-around solutions that take advantage of building IC in 3 dimensions. The 7nm, 5nm and 3nm processes they are claiming are in name alone and represent only the smallest estimated size of the internal functional parts of the circuit being built. Such scale claims are not of each functional transistor unit.

I know. That is what my quoted phrase means. Note I said As physics not When physics.

Endymio · Feb 28, 2024

HardReset said:
When talking about manufacturing process and clock speeds, we naturally take boost clocks because that is where manufacturing tech sets limits.

LOL, no. CPUs have *always* had the ability to run briefly at higher clock rates. It's only been implemented as a feature since thermal throttling was added. I strongly suspect even you realize you're in error here, but don't wish to admit it.

And in any case, you ignored the largest part of the argument: the gargantuan decrease in density increases. Are you still going to claim -- against the opinion of literally every expert on the planet -- that "process tech is still advancing as fast as always"?

HardReset said:
But no, Jensen is trying to claim that 8x increase in transistor count and 2.5x increase in clock speed equals 2.5x increase from manufacturing processes.

Everyone understands the argument but yourself. The process benefits are calculated as if the same design was migrated to a new process node. The very same die.

There was no technical reason that chips in 2015 couldn't have had two, four, or even seven times the transistors as those of today. The reasons were purely economic -- the cost of such a gargantuan die would have been astronomical.

You can argue the process tech made the architectural change feasible. And no one will disagree. The fact remains that a new die with 7X the transistors IS a new architecture.

ZedRM · Feb 28, 2024

HardReset said:
Intel® Core™ M-5Y10 Processor (4M Cache, up to 2.00 GHz) - Product Specifications | Intel

Intel® Core™ M-5Y10 Processor (4M Cache, up to 2.00 GHz) quick reference with specifications, features, and technologies.

ark.intel.com

Fair enough. Had forgotten about the mobile lineup.

HardReset · Feb 29, 2024

Endymio said:
LOL, no. CPUs have *always* had the ability to run briefly at higher clock rates. It's only been implemented as a feature since thermal throttling was added. I strongly suspect even you realize you're in error here, but don't wish to admit it.

And in any case, you ignored the largest part of the argument: the gargantuan decrease in density increases. Are you still going to claim -- against the opinion of literally every expert on the planet -- that "process tech is still advancing as fast as always"?

Then why 14900K max boost clock (single core) is only 6 GHz? Why it's not like 7 GHz? Because manufacturing process and architecture together sets limits and Intel couldn't guarantee, even with binning, that CPU could actually run reliably above 6 GHz. That has nothing to do with thermals. It's just because CPU cannot do more.

I ignored it because I claimed nothing about it. I'm saying Jensens 2.5X claim is BS.

Endymio said:
Everyone understands the argument but yourself. The process benefits are calculated as if the same design was migrated to a new process node. The very same die.

Process technology advances are usually done in three metrics vs previous one:

Power, less power at same clock speed
Speed, more clock speed at same power
Density, more transistors for same area

Same architecture on better manufacturing process would mean faster clocks and/or less power AND smaller die.

Of course that smaller die also requires some redesign but even when using exactly same design, power consumption would be lower on same clock speed (or at same power consumption higher clock speed). As clock speed also depends on design, just looking it and only it is BS like I have been saying.

Endymio said:
There was no technical reason that chips in 2015 couldn't have had two, four, or even seven times the transistors as those of today. The reasons were purely economic -- the cost of such a gargantuan die would have been astronomical.

You can argue the process tech made the architectural change feasible. And no one will disagree. The fact remains that a new die with 7X the transistors IS a new architecture.

Another reason is that cooling them would have been nearly impossible using conventional methods. Just looking at chip only, 980Ti would consume somewhere around 180+ watts. 7 times that would be over 1000+ watts. Good luck with cooling.

Not necessarily. Compare i9-10900KF against low end Skylake architecture one. Leaving out integrated graphics (not used), die size is easily 7 times larger. However 10900KF is still considered Skylake architecture just like any other Skylake architecture chip. So no, 7x count only does not count as new architecture.

New multi-threading technique promises to double processing speeds

hwertz

Posts: 661 +352

HardReset

Posts: 2,743 +2,198

HardReset

Posts: 2,743 +2,198

ZedRM

Posts: 2,910 +1,944

HardReset

Posts: 2,743 +2,198

Endymio

Posts: 3,599 +3,492

HardReset

Posts: 2,743 +2,198

ZedRM

Posts: 2,910 +1,944

HardReset

Posts: 2,743 +2,198

Intel® Core™ M-5Y10 Processor (4M Cache, up to 2.00 GHz) - Product Specifications | Intel

ScottSoapbox

Posts: 1,920 +3,489

Endymio

Posts: 3,599 +3,492

ZedRM

Posts: 2,910 +1,944

Intel® Core™ M-5Y10 Processor (4M Cache, up to 2.00 GHz) - Product Specifications | Intel

HardReset

Posts: 2,743 +2,198

Similar threads

Latest posts