TSMC's stacked wafer tech could double the power of Nvidia and AMD GPUs

midian182

Posts: 9,726   +121
Staff member

The Taiwan Semiconductor Manufacturing Company, better known as TSMC, is holding its 24th annual Technology Symposium in Santa Clara right now, and it’s just unveiled a process that could spell a revolution for graphics cards: Wafer-on-Wafer (WoW) technology.

As the name suggests, WoW works by stacking layers vertically rather than placing them horizontally across a board, much like how 3D NAND flash memory is stacked in modern solid-state drives. What this means is more powerful GPUs for Nvidia and AMD without the need to increase their physical size or shrink the fabrication process.

The back-to-back wafers make contact with each other through the use of 10-micron holes that form a through-silicon via (TSV) connection. TSMC partner Cadence explains that WoW designs can be placed onto an interposer—an electrical interface that routes one connection to another—creating a 2-die cube. It’s even possible to vertically stack more than two wafers using the WoW method.

The technology will allow more cores to be crammed into a single package and means each wafer can communicate with each other extremely quickly and with minimal latencies. What’s especially interesting is the way WoW could be used by manufacturers to place two GPUs on a single card and release it as a product refresh, creating what is essentially two GPUs in one without it appearing as a multiple GPU setup to the operating system.

The biggest issue with WoW right now is the wafer yields. As they are bonded together, if just a single wafer is bad then both have to be discarded, even if there’s no problem with the other one. This means the process would need to be used on production nodes with high yield rates, such as TSMC’s 16nm process, in order to be cost efficient. However, the company aims to use WoW with future 7nm and 5nm fabrication processes.

Permalink to story.

 
And how do they plan on cooling 2 dies put on top of each other when we are already hitting thermal limits pretty hard with existing GPUs?

edit: spelling mistakes
 
Last edited:
And how do the play on cooling 2 dies put on top of each other when we are already hitting thermal limits pretty hard with existing GPUs?

Agreed. If you look at the image, it's clear that TSMC is aiming for a low layer of low power stuff like cache and I/O, and for cores at the top level. This is still great, as it can make a chip like Ryzen half size. Imagine getting ThreadRipper in an AM4 size package.
 
Agreed. If you look at the image, it's clear that TSMC is aiming for a low layer of low power stuff like cache and I/O, and for cores at the top level. This is still great, as it can make a chip like Ryzen half size. Imagine getting ThreadRipper in an AM4 size package.
I don't think that's how it works (how will the latency be for the cache if it has another layer of complexity added to it?) and we are still putting more heat on top of already hot spots.
 
Hmmm good luck cooling that.

Yeah basically EMIB stuck its side. Slowing Moore's Law means chip vendors need to get more creative on packaging solutions to compensate (just as end of Netburst was balanced out by move to multicore)
 
Agreed. If you look at the image, it's clear that TSMC is aiming for a low layer of low power stuff like cache and I/O, and for cores at the top level. This is still great, as it can make a chip like Ryzen half size. Imagine getting ThreadRipper in an AM4 size package.
I don't think that's how it works (how will the latency be for the cache if it has another layer of complexity added to it?) and we are still putting more heat on top of already hot spots.
yea, and this is mostly GPU tech. gloflo makes ryzen chips anyway. This exists for CPUs AFAIK but its harder to layout the processor components without hurting IPC from the little ive read.
 
I don't think that's how it works (how will the latency be for the cache if it has another layer of complexity added to it?) and we are still putting more heat on top of already hot spots.

I don't know how it will be done, but cache latency won't necessarily be higher, and could be lower. Part of the point of the idea is that by using layers you're shortening trace lengths.

As for the heat, it seems like you're saying that this won't work in general, which I think is unfounded. If TSMC offers this process, it means that it should be practical, and it mentioned L3 cache specifically. L3 cache works only a small percentage of the time, so I'd say that there's a good chance that it won't be a serious hot spot.

I'm sure that chip designers will have to take such considerations into account, but it seems to me like this is a possible use for it. For GPUs, this would likely help less, because they are more homogeneous and therefore it would be harder to separate them into low power and high power layers.

I just gave Ryzen as an example. It might be even more relevant to mobile chips, which already offer a high core count and could offer even more with a smaller die.
 
Well, you could do as my neighbor. He was so upset with his computer over heating that he took an old 2.5 ton window A/C unit, extended the coils and used a miniature AHU (air handling unit) to circulate cold air throughout the case. Needless to say he had to heavily regulate it because he was risking freezing the system up but I have to say it was one of the more interesting "do it yourself" projects. Now if you could scale that down to roughly .01 to .05 tons it could be a pretty interesting approach although it would certainly be pricey .....
 
I would agree that cooling could be an interesting challenge with this type of wafer layout, but... Running chips as maxed out as they can handle is typically the reason for needing heavy cooling. For every little bit of power you reduce running through a chip, the thermal effect drops substantially more (which is usually how you get fanless compute & GPU components, by running them lower/slower).

If you double the GPU circuitry with this wafer stacking, and then run the whole thing at 75% of max power, you'd still have a huge net gain in GPU performance with a much lower thermal effect. In theory, at least.
 
I don't know how it will be done, but cache latency won't necessarily be higher, and could be lower. Part of the point of the idea is that by using layers you're shortening trace lengths.

As for the heat, it seems like you're saying that this won't work in general, which I think is unfounded. If TSMC offers this process, it means that it should be practical, and it mentioned L3 cache specifically. L3 cache works only a small percentage of the time, so I'd say that there's a good chance that it won't be a serious hot spot.

I'm sure that chip designers will have to take such considerations into account, but it seems to me like this is a possible use for it. For GPUs, this would likely help less, because they are more homogeneous and therefore it would be harder to separate them into low power and high power layers.

I just gave Ryzen as an example. It might be even more relevant to mobile chips, which already offer a high core count and could offer even more with a smaller die.
Never said it wouldn't work, but things are very simple in my opinion, this is similar to HBM where you have stacked memory. It works there because the priority is put on bandwidth so the temps can be kept in check while still having good performance. But when it comes to chips that depend on not only density but also clock speeds for acceptable performance then you are forced to lower one to increase the other. They can probably stack them if they lower the clocks to keep the temps down.

What this means is that they can probably create monsters when it comes to parallelisation (heavily multithreaded workloads like rendering), but the IPC will be lowered by a significant margin. It's probably great for specialised servers, but I highly doubt we'll see it in the wild (aka mainstream/workstation PCs). I'm not sure how much this will help with games either since you are effectively putting 2 GPUs to get a bit higher compute performance compared to what a higher clocked GPU can do. (I can see this working for some games that don't respond well to GPU clocks but do so for compute cores)
 
Last edited:
Well, you could do as my neighbor. He was so upset with his computer over heating that he took an old 2.5 ton window A/C unit, extended the coils and used a miniature AHU (air handling unit) to circulate cold air throughout the case. Needless to say he had to heavily regulate it because he was risking freezing the system up but I have to say it was one of the more interesting "do it yourself" projects. Now if you could scale that down to roughly .01 to .05 tons it could be a pretty interesting approach although it would certainly be pricey .....

Or you could buy a cheap$10 box fan at Walmart and stick it to the side of your PC tower. I did it for 2 years straight with an AMD Phenom 965BE at 4.2 - 4.4GHZ.
 
God damn had me thinking they make window units that weigh 2 1/2 tons? Then realized its a unit for heat extraction, meh too used to BTUs I guess.

Actually, 2.5 ton is the standard size for a house that is 1200-1500 SF so it's a big unit. You're right in that most window units are rated from 5,000 BTU to over 50,000 BTU. I had asked him why he didn't use a significantly smaller unit but he said he had to go with what he had. Just for FYI .... 2.5 ton is also the standard size for most automobiles. It's so large because of all the window area and the need for "immediate" cooling for most passengers .......
 
"WoW" is right. Though I... and plenty of others I suspect... have long considered "stacked" semiconductors as a means of speeding things up (no need to send a signal to the other side of the chip when you can just go "up" a few nanometers to another layer.)

But the problem of "cooling" has always made this a no-go. You can't apply a crushing heatsink to the lid of a "stacked" processor without crushing the layers.

With modern liquid cooling, I suppose you could flow non-conductive coolant between the layers (a coolant "bath"), but it wouldn't be very effective.
 
Last edited:
Actually, 2.5 ton is the standard size for a house that is 1200-1500 SF so it's a big unit. You're right in that most window units are rated from 5,000 BTU to over 50,000 BTU. I had asked him why he didn't use a significantly smaller unit but he said he had to go with what he had. Just for FYI .... 2.5 ton is also the standard size for most automobiles. It's so large because of all the window area and the need for "immediate" cooling for most passengers .......
When you say tons, you mean the unit cools 2.5 tons of air. I was thinking the unit had somehow weighed 4,500 lbs and there is no way you are fitting an SUV sized AC unit in a window lol..
 
"WoW" is right. Though I... and plenty of others I suspect... have long considered "stacked" semiconductors as a means of speeding things up (no need to send a signal to the other side of the chip when you can just go "up" a few nanometers to another layer.)

But the problem of "cooling" has always made this a no-go. You can't apply a crushing heatsink to the lid of a "stacked" processor without crushing the layers.

With modern liquid cooling, I suppose you could flow non-conductive coolant between the layers (a coolant "bath"), but it wouldn't be very effective.
I'm sure they have torque specs they can figure out. You don't really need the most killer pressure anyways. They already mentioned that temps might not be that significant because of the unified design, each layer won't be working as hard as before. 2 layers at 70% tilt might be around the same temperature as 1 fully tilted or even OC'd and voltage bumped layer.
 
And how do the play on cooling 2 dies put on top of each other when we are already hitting thermal limits pretty hard with existing GPUs?

Agreed. If you look at the image, it's clear that TSMC is aiming for a low layer of low power stuff like cache and I/O, and for cores at the top level. This is still great, as it can make a chip like Ryzen half size. Imagine getting ThreadRipper in an AM4 size package.


actualy it be more like an AMD 3rd Gen Ryzen APU with 4cores & 8threads (Hopefullly higher) with (think of Vega 11 (11 CUs) in the Ryzen 5 2500 am4 cpu or which ever it is , but with Navi 11 stacked (if possible) up to 4x stacked 11CU Navi 10 dies 11 Compute units x 4 = 44 CUs, that would make AM4 APUs aMONSTER CPU with iGPU and would be so many xxxxx faster in 3D gaming for an integrated GPU and another user said AMD uses GloFlo for Ryzen CPUs, WRONG ALL Future and later 7nm and up , all AMDs portfolio has been moved to TSMCs Foundries...
 
This is prob why???AMD moved to TSMC in the1st place from GLoFlo???? damn I cant wait to see this happen... for AMD CPUs and GPUs and it can use IF in between dies packages too I bet, cuz u know AMD is gonna have Infinity fabric build around this tech as well may even enhance and reduce latency as well.... who knows, im hoping its a great thing oce it can be used on 7nm amd dies
 
Back