New multi-threading technique promises to double processing speeds

zohaibahd

Posts: 46   +1
Staff
Forward-looking: New research details a process that allows a CPU, GPU, and AI accelerator to work seamlessly in parallel on separate tasks. The pioneering breakthrough could provide blazing-fast, energy-efficient computing – promising to double overall processing speed at less than half the energy cost.

Researchers at the University of California Riverside have developed a technique called Simultaneous and Heterogeneous Multithreading (SHMT), which builds on contemporary simultaneous multithreading. Simultaneous multithreading splits a CPU core into numerous threads, but SHMT goes further by incorporating the graphics and AI processors.

The key benefit of SHMT is that these components can simultaneously crunch away on entirely different workloads, optimized to each one's strength. The method differs from traditional computing, where the CPU, GPU, and AI accelerator work independently. This separation requires data transfer between the components, which can lead to bottlenecks.

Meanwhile, SHMT uses what the researchers call a "smart quality-aware work-stealing (QAWS) scheduler" to manage the heterogeneous workload dynamically between components. This part of the process aims to balance performance and precision by assigning tasks requiring high accuracy to the CPU rather than the more error-prone AI accelerator, among other things. Additionally, the scheduler can seamlessly reassign jobs to the other processors in real time if one component falls behind.

In testing, SHMT boosted performance by 95 percent and sliced power usage by 51 percent compared to existing techniques. The result is an impressive 4x efficiency uplift. Early proof-of-concept trials utilized Nvidia's Jetson Nano board containing a 64-bit quad-core Arm CPU, 128-core Maxwell GPU, 4GB RAM, and an M.2 slot housing one of Google's Edge TPU AI accelerators. While it's not precisely bleeding-edge hardware, it does mirror standard configurations. Unfortunately, there are some fundamental limitations.

"The limitation of SHMT is not the model itself but more on whether the programmer can revisit the algorithm to exhibit the type of parallelism that makes SHMT easy to exploit," the paper explains.

In other words, it's not a simple universal hardware implementation that any developer can use. Programmers have to learn how to do it or develop tools to do it for them.

If the past is any indication, this is no easy feat. Remember Apple's switch from Intel to Arm-based silicon in Macs? The company had to invest significantly in its developer toolchain to make it easier for devs to adapt their apps to the new architecture. Unless there's a concerted effort from big tech and developers, SHMT could end up a distant dream.

The benefits also depend heavily on problem size. While the peak 95-percent uplift required maximum problem sizes in testing, smaller loads saw diminishing returns. Tiny loads offered almost no gain since there was less opportunity to spread parallel tasks. Nonetheless, if this technology can scale and catch on, the implications could be massive – from slashing data center costs and emissions to curbing freshwater usage for cooling.

Many unanswered questions remain concerning real-world implementations, hardware support, code optimizations, and ideal use-case applications. However, the research does sound promising, given the explosion in generative AI apps over the past couple of years and the sheer amount of processing power it takes to run them.

Permalink to story.

 
This sounds easy on paper but in practice:

- It must be pretty much hardware-based solution that requires no programming input.

- Moving data around is slow. Basically scheduler must determine if it's worth transferring data to another "computing unit" or should it just pick nearest one even if it wasn't fastest for task.

AMD has been trying this over a decade now without much success. So yeah, good luck.
 
As physics starts to limit die shrinks, these types of ideas are the way forward to further processing performance.
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"
 
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"
That 10 GHz promise was indeed realistic and I have no doubt Intel could have achieved it. However Intel decided that clock speed without performance gain is useless and abandoned Netburst architectures and that also meant 10 GHz at 2010 did not happen.
 
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"
True, but some type of parallelism is the best bet for massive gains (eventually not in a single go). We just need to figure out a way to do it for things that aren't as straightforwardly parallel.
 
That 10 GHz promise was indeed realistic and I have no doubt Intel could have achieved it. However Intel decided that clock speed without performance gain is useless and abandoned Netburst architectures and that also meant 10 GHz at 2010 did not happen.
I love how everyone said the Pentium 4 ran too hot and used too much power, but these days we have 300w+ chips that don't thermally throttle until ~80c. It also feels like we are back on the heels of another ghz war.
 
I love how everyone said the Pentium 4 ran too hot and used too much power, but these days we have 300w+ chips that don't thermally throttle until ~80c. It also feels like we are back on the heels of another ghz war.
I don't see AMD CPUs to consume too much power. Thermal limits are high but power consumption stays in place.

I doubt AMD will start GHz war, Zen architecture has always been server first and servers don't really need high clock speeds. Intel on other hand want to win at least single thread crown at any cost so different story there.
 
I don't see AMD CPUs to consume too much power. Thermal limits are high but power consumption stays in place.

I doubt AMD will start GHz war, Zen architecture has always been server first and servers don't really need high clock speeds. Intel on other hand want to win at least single thread crown at any cost so different story there.
Idk, AMD recently changed the spec on AM5 from 170W to 230W and we have been seeing a bump of a few hundred mhz every new generation. Maybe not the GHZ wars of the 2000s, but we have been seeing clock speeds consistently been going up the last ~8years.
 
Idk, AMD recently changed the spec on AM5 from 170W to 230W and we have been seeing a bump of a few hundred mhz every new generation. Maybe not the GHZ wars of the 2000s, but we have been seeing clock speeds consistently been going up the last ~8years.
Natural since manufacturing technologies also develop. Also because making core wider no longer gives that much improvement, more clock speed is obvious solution for more performance. However as said AMD is clearly focused on servers and Intel has problems with manufacturing that explains why clock speed gains are so low. Both AMD and Intel could get much more clock speed but see no way to do it without sacrificing IPC.
 
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"

The problem with Photonics remains the same: How are you generating the light? Because guess what: Generating light is *very* expensive.
 
True, but some type of parallelism is the best bet for massive gains (eventually not in a single go). We just need to figure out a way to do it for things that aren't as straightforwardly parallel.

The issue is that for problems that are not embarrassingly parallel, there's only so much you can do after a point in time to extract more gains via improved parallel code.
 
The problem with Photonics remains the same: How are you generating the light? Because guess what: Generating light is *very* expensive.

Temu: assorted LEDs, 0.4 cents each in bulk

A chip-based solution would be a thousand times cheaper still. Generating light is cheap; computing with light is an entirely different hairball.

Natural since manufacturing technologies also develop.
The days of large-scale manufacturing gains are over. NVidia has stated that, in the last 10 years, they've gotten only a 2.5X gain from process technology: the rest has come from architectural improvements. Over the next 10 years, we'll probably see about half that.

This sounds easy on paper but in practice:

- It must be pretty much hardware-based solution that requires no programming input.
If you read the source paper, it's a library/compiler based solution. In essence, it's simply a lower-level approach to what's already being done. Today, a programmer decides what to run on the CPU vs. the GPU, then the underlying library(ies) generally parallelize from there. In this approach, that parallelization is made at the same time as allocation to hardware, allowing for a more efficient distribution.
 
The days of large-scale manufacturing gains are over. NVidia has stated that, in the last 10 years, they've gotten only a 2.5X gain from process technology: the rest has come from architectural improvements. Over the next 10 years, we'll probably see about half that.
2.5X gain means what?

GTX980Ti (2015): 8B transistors
RTX4090 (2022): 76.3B transistors

Like :confused:
If you read the source paper, it's a library/compiler based solution. In essence, it's simply a lower-level approach to what's already being done. Today, a programmer decides what to run on the CPU vs. the GPU, then the underlying library(ies) generally parallelize from there. In this approach, that parallelization is made at the same time as allocation to hardware, allowing for a more efficient distribution.
That's why I said it looks good on paper. To remind, Intel's Itanium was also supposed to be fast CPU "because compiler makes code that CPU can easily execute and because of that CPU front end could be simple". Problem is, there was never that "ultra compiler".

To put it another way, AMD has been developing same thing over decade now. And they want hardware-based solution for some reason. Not hard to guess that reason. This kind of supercompiler solution look good on paper but making it actually work is super hard.
 
Papa Intel 16900k: Hey kiddo, take care of that for papa, Ok?
Sister RTX 7600x: No way dad... I can't handle it, give it to Junior, I'm playing Candy Crush Remake now.
Junior Intel AIA 2350: It's your turf, deal with yout SHMT!
Sister RTX 7600x: Dad, junior is not doing his shores... Again. And he is cursing.
Grandpa PSU: Say no more kid...
*Junior starts to give smoke and smell funny*
Bypasser Crysis: *Process Crysis.exe terminated - family issues*
 
Last edited:
2.5X gain means what?

GTX980Ti (2015): 8B transistors
RTX4090 (2022): 76.3B transistors
Like :confused:
My sentence continued "..the rest has come from architectural improvements." Jensen claimed a 1000-fold increase in AI performance over the last 10 years. If we chalk up 2.5X of that to process nodes, 10X to the transistor count increase, the remaining 40X increase came from improvements in tensor core IPC.

That's why I said it looks good on paper. To remind, Intel's Itanium was also supposed to be fast CPU "because compiler makes code that CPU can easily execute and because of that CPU front end could be simple". Problem is, there was never that "ultra compiler".
I think you're confusing the rationale for RISC with VLIW. Itanium code wasn't supposed to be "easier for the cpu to run", but rather explicitly parallel. And compilers did exist ... or "a" compiler, at least.

Itanium died for one reason and one reason alone. Because AMD's X86-64 ran IA-32 code natively, as well as or even better than earlier processors, whereas IA-64 ran IA-32 code dozens of times slower, if it all. Companies had an evolutionary road forward when migrating software to X86-64 which didn't exist for Itanium.
 
I think you're confusing the rationale for RISC with VLIW. Itanium code wasn't supposed to be "easier for the cpu to run", but rather explicitly parallel. And compilers did exist ... or "a" compiler, at least.

Itanium died for one reason and one reason alone. Because AMD's X86-64 ran IA-32 code natively, as well as or even better than earlier processors, whereas IA-64 ran IA-32 code dozens of times slower, if it all. Companies had an evolutionary road forward when migrating software to X86-64 which didn't exist for Itanium.
Pretty much this.

From a pure design perspective, Itanium was well thought out. And while Itanium ran x86 code at about a 20% hit, the intent was that x86 would become legacy, and over time faster Itanium processors would eventually run x86 faster then the fastest released x86 CPU.

The problem is rather focus on gaining share in the consumer market, Intel focuses on servers. As a result, when AMD released x86-64 which ran x86 natively (and thus faster), Itanium quickly lost in the marketplace and became an afterthought.

Much like we'd be in a much better place if the Motorola 86000 beat the 386, we'd be in a much better situation if Itanium beat x86-64.
 
My sentence continued "..the rest has come from architectural improvements." Jensen claimed a 1000-fold increase in AI performance over the last 10 years. If we chalk up 2.5X of that to process nodes, 10X to the transistor count increase, the remaining 40X increase came from improvements in tensor core IPC.
What Jensen said makes no sense at all. Transistor count has been grown at least tenfold and that also has enabled much more IPC than just 2.5X. No idea what Jensen is trying to say here. Perhaps he is talking about clock speeds that, again, when talking about GPUs make no sense at all.
I think you're confusing the rationale for RISC with VLIW. Itanium code wasn't supposed to be "easier for the cpu to run", but rather explicitly parallel. And compilers did exist ... or "a" compiler, at least.

Itanium died for one reason and one reason alone. Because AMD's X86-64 ran IA-32 code natively, as well as or even better than earlier processors, whereas IA-64 ran IA-32 code dozens of times slower, if it all. Companies had an evolutionary road forward when migrating software to X86-64 which didn't exist for Itanium.
Basically yes but Itanium also was in order CPU mainly for reason compiler was supposed to make code that do not require CPU to order instructions. So compiler was also supposed to do job easier for CPU. Not only but that too.

With proper compiler Itanium would (probably) have been fast enough to mitigate that problem. ARM is getting good on servers despite not running x86-64 natively. It's just fast enough and other factors come into play. Itanium was not fast enough to overcome other things.

Pretty much this.

From a pure design perspective, Itanium was well thought out. And while Itanium ran x86 code at about a 20% hit, the intent was that x86 would become legacy, and over time faster Itanium processors would eventually run x86 faster then the fastest released x86 CPU.

The problem is rather focus on gaining share in the consumer market, Intel focuses on servers. As a result, when AMD released x86-64 which ran x86 natively (and thus faster), Itanium quickly lost in the marketplace and became an afterthought.

Much like we'd be in a much better place if the Motorola 86000 beat the 386, we'd be in a much better situation if Itanium beat x86-64.
Agreed. Itanium also released 4 years late and still didn't meet promises. Also x86-64 CPUs were faster than Itanium so what exactly were Itanium advantage against x86-64? Some special software but that was it. Basically there were no reasons to go with Itanium. If it had been much faster than x86-64, things would probably be different.
 
What Jensen said makes no sense at all. Transistor count has been grown at least tenfold and that also has enabled much more IPC than just 2.5X. No idea what Jensen is trying to say here. Perhaps he is talking about clock speeds that, again, when talking about GPUs make no sense at all.
You're confusing concepts here. IPC is an architectural feature, not a process improvement. If a current CPU design is migrated to a new process node, its IPC won't change whatsoever, but it will (hopefully) hit higher clocks at lower power consumption.

Agreed. Itanium also released 4 years late and still didn't meet promises. Also x86-64 CPUs were faster than Itanium so what exactly were Itanium advantage...
Itanium was released more than a year before X86-64, and it did deliver mainly on promises. And x86-64 was not "faster" than Itanium, not when running 64-bit code. But as we've pointed out already, x86-64 had that all-important legacy support ... and a far lower price.
 
You're confusing concepts here. IPC is an architectural feature, not a process improvement. If a current CPU design is migrated to a new process node, its IPC won't change whatsoever, but it will (hopefully) hit higher clocks at lower power consumption.


Itanium was released more than a year before X86-64, and it did deliver mainly on promises. And x86-64 was not "faster" than Itanium, not when running 64-bit code. But as we've pointed out already, x86-64 had that all-important legacy support ... and a far lower price.

Um, improvement on process technology Is key to improve architecture and IPC. It's not that you cannot make extra complex CPU or GPU but process technology sets limits on what is achievable or even realistic. RTX4090 has much higher IPC vs GTX980 Ti. Main reason is that RTX4090 has also more transistors. Even Pentium 4 was supposed to be much more complex but was cut down just because manufacturing tech at that time would have meant too big chip. So manufacturing tech is also IPC unless doing pure die shrink of course.

Itanium was supposed to be released 1997 @ 1 GHz. It released 2001 @ 800 MHz. I remember this well. Basically Itanium was late from beginning and not much later (few years) AMD x86-64 CPU was faster, coole, cheaper and backwards compatible. Advertisement that said: "Given how slow and hot our competitor CPUs are it's no wonder they Rhymes with Hell" pretty much summarized it.

Backwards compatibility was just final nail on coffin. There was basically nothing on large scale that Itanic did better.
 
I think I finally understood wtf Jensen was trying to say with that 2.5X improvement BS. GTX980Ti had clock speed around 1000 MHz, RTX4090Ti has clock speed around 2500 MHz. That makes 2.5X improvement.

What that BS is trying to say is that Nvidia has made huge leaps in architecture (they have not) and that explains their awesome products. That's total BS of course since even without Any IPC improvements, RTX4090 would be around 8 times faster on non-memory limiting scenarios on same clock speed and 20 times faster on nominal clock speeds. Jensen is just trying to downplay fact Nvidia has done more die shrinks than real improvements.

Typical Nvidia BS from Jensen as usual.
 
Agreed. Itanium also released 4 years late and still didn't meet promises. Also x86-64 CPUs were faster than Itanium so what exactly were Itanium advantage against x86-64? Some special software but that was it. Basically there were no reasons to go with Itanium. If it had been much faster than x86-64, things would probably be different.
The advantage would have been faster execution through implicit parallelization, but the poor Itanium compiler combined with the fact we hadn't really have good toolchains for writing parallel code at the time (remember: This was early C2D days) limited what Itanium could do at the time.

Also worth noting that all those CPU side security bugs that have shown up the past two decades or so? Itanium is immune to pretty much all of them.
 
I
The advantage would have been faster execution through implicit parallelization, but the poor Itanium compiler combined with the fact we hadn't really have good toolchains for writing parallel code at the time (remember: This was early C2D days) limited what Itanium could do at the time.

Also worth noting that all those CPU side security bugs that have shown up the past two decades or so? Itanium is immune to pretty much all of them.

Intel did support Itanium for a while but perhaps Intel did see Itanium would never be success. They had money to support it however instead abandoning quicker. I don't believe Itanium would have been success even with proper software. Cell is another example of "hardware is fast IF somehow code suits for it" -type CPU and not surprisingly it was trashed by products that are easier for coders.

Itanium might have own bugs, but of course nothing speculative execution related that most bugs today seem to be. Also since Itanium was not popular CPU, it barely made sense to hunt bugs for it.
 
I think I finally understood wtf Jensen was trying to say with that 2.5X improvement BS.
Great. It was evident to everyone else from the first post. Process improvements are just that: smaller, faster transistors. Nothing else.

What that BS is trying to say is that Nvidia has made huge leaps in architecture (they have not) ... total BS of course since even without Any IPC improvements, RTX4090 would be around 8 times faster on non-memory limiting scenarios on same clock speed and 20 times faster on nominal clock speeds.
Yet on model training and other AI tasks, their newest products are as much as one thousand times faster. Not just a mere 20x. Seems Jensen actually understands his products after all.
 
Back