Has Nvidia won the AI training market?

Jay Goldberg

Posts: 75   +1
Staff

AI chips serve two functions. AI builders first take a large (or truly massive) set of data and run complex software to look for patterns in that data. Those patterns are expressed as a model, and so we have chips that "train" the system to generate a model.

Then this model is used to make a prediction from a new piece of data, and the model infers some likely outcome from that data. Here, inference chips run the new data against the model that has already been trained. These two purposes are very different.

Training chips are designed to run full tilt, sometimes for weeks at a time, until the model is completed. Training chips thus tend to be large, "heavy iron."

Inference chips are more diverse, some of these are used in data centers, others are used at the "edge" in devices like smartphones and video cameras. These chips tend to be more varied, designed to optimize different aspects like power efficiency at the edge. And, of course, there all sorts of in-between variants. The point is that there are big differences between "AI chips."

For chip designers, these are very different products, but as with all things semiconductors, what matters most is the software that runs on them. Viewed in this light, the situation is much simpler, but also dizzyingly complicated.

Simple because inference chips generally just need to run the models that come from the training chips (yes, we are oversimplifying). Complicated because the software that runs on training chips is hugely varied. And this is crucial. There are hundreds, probably thousands, of frameworks now used for training models. There are some incredibly good open-source libraries, but also many of the big AI companies/hyperscalers build their own.

Because the field for training software frameworks is so fragmented, it is effectively impossible to build a chip that is optimized for them. As we have pointed out in the past, small changes in software can effectively neuter the gains provided by special-purpose chips. Moreover, the people running the training software want that software to be highly optimized for the silicon on which it runs. The programmers running this software probably do not want to muck around with the intricacies of every chip, their life is hard enough building those training systems. They do not want to have to learn low-level code for one chip only to have to re-learn the hacks and shortcuts for a new one later. Even if that new chip offers "20%" better performance, the hassle of re-optimizing the code and learning the new chip renders that advantage moot.

Which brings us to CUDA -- Nvidia's low-level chip programming framework. By this point, any software engineer working on training systems probably knows a fair bit about using CUDA. CUDA is not perfect, or elegant, or especially easy, but it is familiar. On such whimsies are vast fortunes built. Because the software environment for training is already so diverse and changing rapidly, the default solution for training chips is Nvidia GPUs.

The market for all these AI chips is a few billion dollars right now and is forecasted to grow 30% or 40% a year for the foreseeable future. One study from McKinsey (maybe not the most authoritative source here) puts the data center AI chip market at $13 billion to $15 billion by 2025 -- by comparison the total CPU market is about $75 billion right now.

Of that $15 billion AI market, it breaks down to roughly two-thirds inference and one-third training. So this is a sizable market. One wrinkle in all this is that training chips are priced in the $1,000's or even $10,000's, while inference chips are priced in the $100's+, which means the total number of training chips is only a tiny share of the total, roughly 10%-20% of units.

On the long term, this is going to be important on how the market takes shape. Nvidia is going to have a lot of training margin, which it can bring to bear in competing for the inference market, similar to how Intel once used PC CPUs to fill its fabs and data center CPUs to generate much of its profits.

To be clear, Nvidia is not the only player in this market. AMD also makes GPUs, but never developed an effective (or at least widely adopted) alternative to CUDA. They have a fairly small share of the AI GPU market, and we do not see that changing any time soon.

Also read: Why is Amazon building CPUs?

There are a number of startups that tried to build training chips, but these mostly got impaled on the software problem above. And for what it's worth, AWS has also deployed their own, internally-designed training chip, cleverly named Trainium. From what we can tell this has met with modest success, AWS does not have any clear advantage here other than its own internal (massive) workloads. However, we understand they are moving forward with the next generation of Trainium, so they must be happy with the results so far.

Some of the other hyperscalers may be building their own training chips as well, notably Google which has new variants of its TPU coming soon that are specifically tuned for training. And that is the market. Put simply, we think most people in the market for training compute will look to build their models on Nvidia GPUs.

Permalink to story.

 
This isn't new, it's actually been the state of the AI industry for quite a while, actually. And by a while I mean I remember this being the case as far back as 2015 (before then I didn't pay much attention). For example, TensorFlow only supports CUDA enabled GPUs out of the box (you can do some fancy tricks and compile from source to get it to run on AMD). Most people just found OpenCL too difficult to work with, so it never took off like CUDA did.

But things are slowly changing. AMD might pull of a "Zen" moment in the AI space, but I don't expect it to topple Nviida in the next decade.
 
This isn't new, it's actually been the state of the AI industry for quite a while, actually. And by a while I mean I remember this being the case as far back as 2015 (before then I didn't pay much attention). For example, TensorFlow only supports CUDA enabled GPUs out of the box (you can do some fancy tricks and compile from source to get it to run on AMD). Most people just found OpenCL too difficult to work with, so it never took off like CUDA did.

But things are slowly changing. AMD might pull of a "Zen" moment in the AI space, but I don't expect it to topple Nviida in the next decade.
They better hurry or it may be too late for them. Nvidia chips already offer great performance in stuff like stable diffusion leaving AMD in the dust. I know some people spend a lot on RTX 4090/3090Ti just for stable diffusion.
 
Companies that will never again co-operate with Nvidia on any hardware:

- Apple
- Sony
- Microsoft

Doesn't look so clear anymore...
From the looks of it, nVidia has pissed off enough billion and even trillion dollar companies that they're willing to spend the money out of spite. And even with all the money involved, companies like Apple, Amazon, Intel may find it not financially viable long term to let nVidia exercise their position in the AI space.
 
Yes Nvidia totally dominates the hardware AI market for one reason, cuda and I don't see how they can lose short of a miraculous new open source alternative, that is far easier to implement and just as fast and is hardware agnostic.

AMD and Intel have the hardware to compete, but unless they car run cuda just as fast or faster it won't matter much.
 
This isn't new, it's actually been the state of the AI industry for quite a while, actually. And by a while I mean I remember this being the case as far back as 2015 (before then I didn't pay much attention). For example, TensorFlow only supports CUDA enabled GPUs out of the box (you can do some fancy tricks and compile from source to get it to run on AMD). Most people just found OpenCL too difficult to work with, so it never took off like CUDA did.

But things are slowly changing. AMD might pull of a "Zen" moment in the AI space, but I don't expect it to topple Nviida in the next decade.
OpenCL difficult? I've been writing opencl code for 11 years, and already then I chose it over cuda for its ease, adaptability, portability between systems and hardware (not performance portability, that required a little more work). I have written opencl even for mobile. I would never stick with software tied to hardware from a single manufacturer. I need something that takes advantage of what is available, whatever it is, as long as it is compatible, CPU, GPU, FPGA, DSP, etc. from a mobile phone to even a supercomputer.

I use Tensorflow with opencl, on my AMDs, without the need to recompile, using PlaidML as backend (veeeery simple, a two line change in the code) or with DirectML.
 
Last edited:
Yes Nvidia totally dominates the hardware AI market for one reason, cuda and I don't see how they can lose short of a miraculous new open source alternative, that is far easier to implement and just as fast and is hardware agnostic.

AMD and Intel have the hardware to compete, but unless they car run cuda just as fast or faster it won't matter much.
opencl, since 2009 at least, but nvidia pu$h with the wallet
 
They better hurry or it may be too late for them. Nvidia chips already offer great performance in stuff like stable diffusion leaving AMD in the dust. I know some people spend a lot on RTX 4090/3090Ti just for stable diffusion.
the thing is that PyTorch (the IA framework used by SD) has never been optimized for AMD. rigth now you should use a DirectML back end, or ROCm for translation and execution (linux only) or SHARK for RX6000/7000. now RX700 finnaly have some special units and functions that helps to accelerate some neural networks functions and layers, so I hope for more adoption and better implementations and optimizations.
 
Of course, it's so simple. Everything is a zero sum game. If big tech company X succeeds at solving problem Y, then we the consumers lose.

The problem is that Nvidia, having solved the initial problem of building the first large framework for AI training, can now rely on the fact that everyone already uses their framework, make it so that everyone will continue to use their framework even if it stagnates, and does not solve future problems well.
These large tool monopolies present enormous problems for growth and innovation in every sector through sheer market inertia and the associated initial training and end costs in moving to other tools. Excel, AutoCAD, Matlab, the Adobe suite. It introduces forces that stifle industries and actively prevent breakthroughs as you create effective dead zones of competition. It’s great for a company’s stock price to have such a zone though xD investors love a monopoly, which is kinda funny considering the system is meant to encourage innovation and competition.
 
I think there is a serious chance that each of the big tech companies use their own silicon instead of having nvidia in the mix. Nvidia is at the top but their main thing is cuda. With enough resources from a big company like google that could be remedied. Funny thing is that there is a chance that nvidia if cornered will pass the ML power to retail. Which is a good thing.
There is something scary about a huge corporation like Microsoft controlling the future chat gpt, which could be customer support, writer, artist, accountant and programmer . Please let's not talk about democratising small businesses that could pay for chat gpt.

I think with enough time we will start seeing huge subscription charges for using machine learning. Equal to the monthly salary of a human who could do the job. The better it becomes the more expensive it will get.
So there will be a lot of regulation very soon. What if ML could replace 40% of the workforce in 15 years time? The funny thing is that these models very soon will start training on their own results that already flood the internet. Which probably will lead to even more bizarre results.
 
the thing is that PyTorch (the IA framework used by SD) has never been optimized for AMD. rigth now you should use a DirectML back end, or ROCm for translation and execution (linux only) or SHARK for RX6000/7000. now RX700 finnaly have some special units and functions that helps to accelerate some neural networks functions and layers, so I hope for more adoption and better implementations and optimizations.
Another question mark about RocM is which Radeon cards it officially supports. I read their documentation and I couldn't get a clear answer to this question. Vega and Polaris cards are supported but I'm not sure which Navi cards are supported. At the moment I don't know if my RX 6600 is supported or not.
Also as of now RocM works only in Linux, opencl is used in windows.
AMD's solutions are convoluted. Consumers don't like this.
Hardware and Software Support Reference Guide
 
Another question mark about RocM is which Radeon cards it officially supports. I read their documentation and I couldn't get a clear answer to this question. Vega and Polaris cards are supported but I'm not sure which Navi cards are supported. At the moment I don't know if my RX 6600 is supported or not.
Also as of now RocM works only in Linux, opencl is used in windows.
AMD's solutions are convoluted. Consumers don't like this.
Hardware and Software Support Reference Guide
My understanding is that they clearly said that R(endering)DNA is for gaming and C(omputing)DNA is for the rest.

Not sure if its good or bad, but thats what they do.

Here is more info about ROCm (granted, I know almost nothing on that matter, but what I understood is that it allows CUDA code to be reused with their supported GPUS and yes, RDNA is supported but for some reason, they dont post that.):

 
OpenCL difficult? I've been writing opencl code for 11 years, and already then I chose it over cuda for its ease, adaptability, portability between systems and hardware (not performance portability, that required a little more work). I have written opencl even for mobile. I would never stick with software tied to hardware from a single manufacturer. I need something that takes advantage of what is available, whatever it is, as long as it is compatible, CPU, GPU, FPGA, DSP, etc. from a mobile phone to even a supercomputer.

I use Tensorflow with opencl, on my AMDs, without the need to recompile, using PlaidML as backend (veeeery simple, a two line change in the code) or with DirectML.
I may be mistaken about the difficult part. I don't program in that space, I just remember reading in a few places that it was difficult to optimize OpenCL compared to CUDA, but that could've been hardware/driver support issues and not a problem with the framework per se.
 
From the looks of it, nVidia has pissed off enough billion and even trillion dollar companies that they're willing to spend the money out of spite.
A poor analysis. Companies don't hold grudges. NVidia is attracting enormous competition because those firms realize that AI is --the-- growth sector of the next decade.
 
A poor analysis. Companies don't hold grudges. NVidia is attracting enormous competition because those firms realize that AI is --the-- growth sector of the next decade.
My analysis is that nVidia has pissed off enough companies and shown their true colors to the point where companies see investing in their own infrastructure financially beneficial rather than investing in nVidia's off the shelf solution.

They don't want to let nVidia to have them by the financial balls so they front the cost right now rather than risk it in the future. And it's not like nVidia hasn't shown they're willing to do that
 
Back