Cerebras unveils first trillion transistor chip - the world's largest

Er, don't people usually quote GPU power draw by what a single GPU itself draws, rather than the platform as a whole? I.e. the GV100 draws 250W, not it and the entire system supporting it, no?

Add a nice heavy server platform for each GPU, or bank of GPUs, and that power figure rises sharply. So I'm guessing Cerebras is as efficient as it can get for some many (potential) GPUs, using your supposition.
Cerebras themselves quote 15kW for their WSE chip, just as Nvidia quotes 250W for their GV100; both require a system around them to function - I.e. a controlling CPU or two, motherboard, system memory, storage, network functionality and so on. This is why I used Nvidia's DXG-2 as a further example, as it's 10kW peak. The WSE will need a similar system around it too and, at the moment, Cerebras have given no information about this (they design and manufacture such systems themselves); there's no information about it's actual theoretical performance at the moment, either, which is a shame.

Clearly big wafers are the future. Or they wouldn't have bothered.
Intel, AMD, IBM, Google, Nvidia, and Qualcomm haven't bothered - they're very much of the opinion that exceptionally large monolith dies are not the future. Obviously, the team at Cerebras believe there is a market for the product they've designed (or at the very least, this is the advertised intention behind it; sometimes, unique designs are created to generate investment interest rather than retail demand) and the AI/TPU market is still very young, with no overall approach being immediately better than the others - for example, Intel uses an entirely CPU approach; AMD it's a mixture of CPU and GPU; Nvidia it's almost entirely GPU; for Google and Qualcomm, they designed their own tensor processing units. This would suggest that there is a further niche for Cerebras to fit into; it certainly doesn't suggest that their direction is the best solution for the AI market.
 
Cerebras themselves quote 15kW for their WSE chip, just as Nvidia quotes 250W for their GV100; both require a system around them to function - I.e. a controlling CPU or two, motherboard, system memory, storage, network functionality and so on. This is why I used Nvidia's DXG-2 as a further example, as it's 10kW peak. The WSE will need a similar system around it too and, at the moment, Cerebras have given no information about this (they design and manufacture such systems themselves); there's no information about it's actual theoretical performance at the moment, either, which is a shame.


Intel, AMD, IBM, Google, Nvidia, and Qualcomm haven't bothered - they're very much of the opinion that exceptionally large monolith dies are not the future. Obviously, the team at Cerebras believe there is a market for the product they've designed (or at the very least, this is the advertised intention behind it; sometimes, unique designs are created to generate investment interest rather than retail demand) and the AI/TPU market is still very young, with no overall approach being immediately better than the others - for example, Intel uses an entirely CPU approach; AMD it's a mixture of CPU and GPU; Nvidia it's almost entirely GPU; for Google and Qualcomm, they designed their own tensor processing units. This would suggest that there is a further niche for Cerebras to fit into; it certainly doesn't suggest that their direction is the best solution for the AI market.
Time will tell...and more likely the systems that do use them - we won't see.
 
You are comparing power consumption of GV100 to Cerebras, but you forget interchip communication. Think I said it three times already...
Also, we don't know how one Cerebras chip stacks against 56 GV100 in AI workloads. I could bet it is some order of magnitude faster, hence some orders of magnitude more efficient.
Don't get why it is so hard for you to understand. You are comparing Cerebras and GV100 in terms of power consumption and performance like you would say Core 2 Duo using FSB is equivalent in consumption/performance/efficiency to a IMC solution. Comunication outside the core is always very expensive from a efficiency and performance standpoint. Given 56 GV100 cores just think how much power will be wasted for moving data between chips..
 
You are comparing power consumption of GV100 to Cerebras, but you forget interchip communication. Think I said it three times already...
I haven't forgotten at all - reread what I've put down, specifically the part where in the DXG-2, it needs 8 NVLink chips, which are rated at 100W each alone. The entire unit is rated at 10 kW peak, with the 8 GPUs in it accounting for 40% of that total and the NVLinks at 8% - this isn't a trivial amount. I've also pointed that out Nvidia themselves have acknowledged that there is a power consumption problem with interchip communications to overcome.

The WSE will have this problem as well, albeit to a much lesser extent. It has a total of 18 GiB of SRAM, distributed across the TPU blocks, of which there 84 in total - giving each block around 220 MiB of SRAM. The blocks have no cache hierarchy, so other than registers, this is the entire local memory for each block to work with; the rest, of course, can still be accessed but via the Fabric Switch and on-die interconnects aren't power-free (nor latency free, for that matter). It's still better and preferable to the latency and power drawbacks of inter-module communication.

Just on the point of the SRAM, the WSE has a decent amount to be working with per block: by comparison, the GV100 has just 16 MiB and it's hierarchical. But this is why it has 32 GiB of HBM2 on-module (and gets included in the 250W power value for the GPU). So in terms of per block total local memory, the WSE has far less the GV100 but it's with better latency; the per block bandwidth is roughly 110 GiB/s, about 8 times less than with the HBM2. But what happens if the data set being used is larger than the total amount of on-die available memory? In the case of the WSE, it can utilise other blocks' memory, obvious up to the total on-die amount; after that, it's the same as it is for the GV100 - off to the system memory. Cue associated latency and power usage.

Let me just say that I'm not disputing your point about the power wasted with intermodule/chip communication, nor am I disputing that the WSE has a clear advantage in this area. I have said, though, that the current market for AI computation isn't all that concerned about power, given the choice of systems used. Cerebras have said that the product is "in use" and although I'm not sure if they meant commercially or just as a functional unit, somebody somewhere will want it. I'm sure there are sectors where power is important: the question to consider is whether or not there is enough of them for the WSE to be a successful product. Note that the WSE can't be manufactured to anything like the scale or cost effectiveness of GPUs, given that it's entirely custom production system. The cost of that will either have to be absorbed somewhat by Cerebras, passed down to the purchaser, or a combination of both.
---
Edit: Although I knew wafer scale processors have been in research for quite some time now, I was surprised to come across this article, dated just February of this year:

https://www.nextplatform.com/2019/02/11/giving-waferscale-processors-another-shot/

At first I wondered if the author of that research paper was part of the Cerebras team, and while he's not, a subsequent search showed his and TSMCs commentary on the WSE announcement:

https://economictimes.indiatimes.co...mputer-chip/articleshow/70748303.cms?from=mdr

The process is a “lot more labor intensive,” said Brad Paulsen, a senior vice president with TSMC...“This is a challenge for us,” Paulsen said. “And it is a challenge for them.”

“It is not that people have not been able to build this kind of a chip,” said Rakesh Kumar, a professor at the University of Illinois who is also exploring large chips for AI. “The problem is that they have not been able to build one that is commercially feasible.”
---
Also, we don't know how one Cerebras chip stacks against 56 GV100 in AI workloads. I could bet it is some order of magnitude faster, hence some orders of magnitude more efficient.
Indeed it could be; unfortunately we don't know anything yet about the operation of the WSE's TPUs: we don't know the basics such as clock speed, nor do we know anything about the scheduler, instruction issue rate, instruction latency, etc. What we do know is that the scheduler analyses the data to ensure that no zero value calculations take place, ensuring the logic units only ever produce a non-zero value. That has lots of potential - it will naturally depend on how good the scheduler is. I don't expect it to be a poor performer, though, especially compared to the GV100 as its Tensor cores are pretty basic (even those in Turing aren't that much better).
 
Last edited:
Back