You are comparing power consumption of GV100 to Cerebras, but you forget interchip communication. Think I said it three times already...
I haven't forgotten at all - reread what I've put down, specifically the part where in the DXG-2, it needs 8 NVLink chips, which are rated at 100W each alone. The entire unit is rated at 10 kW peak, with the 8 GPUs in it accounting for 40% of that total and the NVLinks at 8% - this isn't a trivial amount. I've also pointed that out Nvidia themselves have acknowledged that there is a power consumption problem with interchip communications to overcome.
The WSE
will have this problem as well, albeit to a much lesser extent. It has a total of 18 GiB of SRAM, distributed across the TPU blocks, of which there 84 in total - giving each block around 220 MiB of SRAM. The blocks have no cache hierarchy, so other than registers, this is the entire local memory for each block to work with; the rest, of course, can still be accessed but via the Fabric Switch and on-die interconnects aren't power-free (nor latency free, for that matter). It's still better and preferable to the latency and power drawbacks of inter-module communication.
Just on the point of the SRAM, the WSE has a decent amount to be working with per block: by comparison, the GV100 has just 16 MiB and it's hierarchical. But this is why it has 32 GiB of HBM2 on-module (and gets included in the 250W power value for the GPU). So in terms of per block total local memory, the WSE has far less the GV100 but it's with better latency; the per block bandwidth is roughly 110 GiB/s, about 8 times less than with the HBM2. But what happens if the data set being used is larger than the total amount of on-die available memory? In the case of the WSE, it can utilise other blocks' memory, obvious up to the total on-die amount; after that, it's the same as it is for the GV100 - off to the system memory. Cue associated latency and power usage.
Let me just say that I'm
not disputing your point about the power wasted with intermodule/chip communication, nor am I disputing that the WSE has a clear advantage in this area. I have said, though, that the current market for AI computation isn't all that concerned about power, given the choice of systems used. Cerebras have said that the product is "in use" and although I'm not sure if they meant commercially or just as a functional unit, somebody somewhere will want it. I'm sure there are sectors where power is important: the question to consider is whether or not there is enough of them for the WSE to be a successful product. Note that the WSE can't be manufactured to anything like the scale or cost effectiveness of GPUs, given that it's entirely custom production system. The cost of that will either have to be absorbed somewhat by Cerebras, passed down to the purchaser, or a combination of both.
---
Edit: Although I knew wafer scale processors have been in research for quite some time now, I was surprised to come across this article, dated just February of this year:
https://www.nextplatform.com/2019/02/11/giving-waferscale-processors-another-shot/
At first I wondered if the author of that research paper was part of the Cerebras team, and while he's not, a subsequent search showed his and TSMCs commentary on the WSE announcement:
https://economictimes.indiatimes.co...mputer-chip/articleshow/70748303.cms?from=mdr
The process is a “lot more labor intensive,” said Brad Paulsen, a senior vice president with TSMC...“This is a challenge for us,” Paulsen said. “And it is a challenge for them.”
“It is not that people have not been able to build this kind of a chip,” said Rakesh Kumar, a professor at the University of Illinois who is also exploring large chips for AI. “The problem is that they have not been able to build one that is commercially feasible.”
---
Also, we don't know how one Cerebras chip stacks against 56 GV100 in AI workloads. I could bet it is some order of magnitude faster, hence some orders of magnitude more efficient.
Indeed it could be; unfortunately we don't know anything yet about the operation of the WSE's TPUs: we don't know the basics such as clock speed, nor do we know anything about the scheduler, instruction issue rate, instruction latency, etc. What we do know is that the scheduler analyses the data to ensure that no zero value calculations take place, ensuring the logic units only ever produce a non-zero value. That has lots of potential - it will naturally depend on how good the scheduler is. I don't expect it to be a poor performer, though, especially compared to the GV100 as its Tensor cores are pretty basic (even those in Turing aren't that much better).