AMD AI chips are nearly as fast as Nvidia's, MosaicML says

Alfonso Maruccia

Posts: 1,028   +302
Staff
Why it matters: MosaicML is an AI startup that was recently acquired by Databricks for $1.3 billion. Both companies advocate for a DIY approach to AI systems and LLM training platforms, enabling companies to maintain control over their AI applications. Regarding hardware, MosaicML claims AMD chips can deliver nearly equivalent performance to Nvidia chips.

As Nvidia's recent surge in market capitalization clearly demonstrates, the AI industry is in desperate need of new hardware to train large language models (LLMs) and other AI-based algorithms. While server and HPC GPUs may be worthless for gaming, they serve as the foundation for data centers and supercomputers that perform highly parallelized computations necessary for these systems.

When it comes to AI training, Nvidia's GPUs have been the most desirable to date. In recent weeks, the company briefly achieved an unprecedented $1 trillion market capitalization due to this very reason. However, MosaicML now emphasizes that Nvidia is just one choice in a multifaceted hardware market, suggesting companies investing in AI should not blindly spend a fortune on Team Green's highly sought-after chips.

The AI startup tested AMD MI250 and Nvidia A100 cards, both of which are one generation behind each company's current flagship HPC GPUs. They used their own software tools, along with the Meta-backed open-source software PyTorch and AMD's proprietary software, for testing.

MosaicML trained an LLM algorithm without making any changes to the underlying software code, and found that AMD chips performed nearly as well as those from Nvidia.

In real workload-based tests, MosaicML reports that the LLM training stack remained stable and performed well without any additional configuration. AMD MI250 GPUs were "competitive," the company stated, delivering 80 percent of the per-GPU data throughput offered by Nvidia's A100 40 GB model and within 73 percent compared to the A100 800 GB model.

Hanlin Tang, chief technology officer at MosaicML, states that the major weakness for most companies manufacturing chips for ML algorithm acceleration lies in their software. AMD excelled in this area, and the company is expecting even better performance on new HPC GPUs as software tools continue to improve. It should be mentioned however that CUDA – Nvidia's low-level chip programming framework, has become a sort of standard in the industry, at least for now. CUDA is not perfect, or elegant, or especially easy, but it is familiar and it is Nvidia-only.

AMD is understandably pleased with MosaicML's findings, which seemingly validate the company's strategy of supporting an "open and easy to implement software ecosystem" for AI training and inference on its chips. Nvidia, meanwhile, declined to comment.

Permalink to story.

 
Interesting article.
So AI and LLM is running comparable on AMD or Nvidia hardware, and what is matter the most for performance difference it is how Intelligent is AI implemented on software level?
 
Interesting article.
So AI and LLM is running comparable on AMD or Nvidia hardware, and what is matter the most for performance difference it is how Intelligent is AI implemented on software level?

Well, I don't know if the AI software level has as much to do with it (since MosaicML didn't change the high level source code at all) as much as the software *stack*. I'm guessing that the main difference is that CUDA has been around longer and is more optimized and in tune with the Nvidia hardware backend. But that is an artificial difference I'd claim (AMD can get some more ninja programmers to find the difference). The bigger newsworthy thing here is that AI and LLM workloads can actually leverage AMD hardware now, which previously was a huge blocker because of how much CUDA was in control of the software stack.
 
AMD MI250 GPUs were "competitive," the company stated, delivering 80 percent of the per-GPU data throughput offered by Nvidia's A100 40 GB model and within 73 percent compared to the A100 800 GB model.

According to this link, which is the same article but believe it or not, more informative, it states that those numbers on nvidia parts are actually from a system with double the amount of nvidia gpus, yet not mentioned here because nvidia must be protected or something.



The AI Training Throughput was done on a range of models from 1 Billion to 13 Billion parameters. Testing showed that the AMD Instinct MI250 delivered 80% of the performance of NVIDIA's A100 40GB and 73% performance of the 80GB variant. NVIDIA did retain its leadership position in all of the benchmarks but it should be mentioned that they also had twice as many GPUs running in the tests. Furthermore, it is mentioned that further improvements on the training side are expected for AMD Instinct accelerators in the future.
Oh well, it is what it is.
 
AI will not be constrained to proprietary solutions - there is too much money , to much at stake and too limiting

I know nothing about CUDA - however a I guarantee NVidia in the future will need a CUDA2 and an open AI platform.

The whole AI vertical production line we be analysed - weakness in the lower level will not ultimately be acceptable - a lot of this will be analysed by AI itself
I mean can CUDA give fuzzy output - that is not 0 or 1?- IMHO AI needs fuzzy inputs - people add noise to algorithms to give smarter answers already

as an aside I was thinking the other day how people dismissed Chat Gpt - just a dumb big statistical language model - - Then how to we do it - speak 100 words a minute ? do we do something different ??

We also must use Chatgpt like skills - but we have additional meta abilities and emotional value systems etc + consciousness
 
According to this link, which is the same article but believe it or not, more informative, it states that those numbers on nvidia parts are actually from a system with double the amount of nvidia gpus, yet not mentioned here because nvidia must be protected or something.




Oh well, it is what it is.
They do say that the numbers are per-GPU.
 
Jim Keller is going to start making Ai chips.

If u have NV stock, it time to look at the market, not hype...
 
I thought you would use the AI to write better AI code, whether it be Cuda or another language...
 
I have a really hard time believing that AMD software "excels" at AI. Given that its historically AMDs
Achiles heel as its orders of magnitude behind nvidia.

BTW CUDA is fairly easy, very capable and diverse. It may not be "perfect" (compared to what?) but there's a reason its the defacto standard. It makes OPENCL pretty much non-existant by comparison.
 
Almost as fast as Nvidia - The story of AMD.

Yeah, but you missed the crucial point: its almost as fast as NV with just a recent change to ROCm software itself, but no underlying code change (AKA its still running CUDA code which was directly translated).

Essentially, the AI software was built to run off CUDA in the first place. Had it been designed and optimized to run off open standards which AMD uses (and are always better than proprietary solutions), the results would likely be completely equal, if not possibly in AMD favour altogether (the articles themselves also mention that with further software optimizations to ROCm, its likely performance will equalize completely or shift to AMD side).

Also, these tests were done on MI250... not MI250X (which is even more powerful).
 
Yeah, but you missed the crucial point: its almost as fast as NV with just a recent change to ROCm software itself, but no underlying code change (AKA its still running CUDA code which was directly translated).

Essentially, the AI software was built to run off CUDA in the first place. Had it been designed and optimized to run off open standards which AMD uses (and are always better than proprietary solutions), the results would likely be completely equal, if not possibly in AMD favour altogether (the articles themselves also mention that with further software optimizations to ROCm, its likely performance will equalize completely or shift to AMD side).

Also, these tests were done on MI250... not MI250X (which is even more powerful).
You two also missed that the Ngreedia system had double the amount of GPU's than the AMD one.
 
You two also missed that the Ngreedia system had double the amount of GPU's than the AMD one.
That is correct, but Mosaic reported its findings per GPU in the systems tested:

649f22efda90a745c8606d48_Screenshot%202023-06-30%20at%202.45.46%20PM.png


The figures shouldn't come as a surprise, though. AMD's MI250 has a very high FP32/FP64 FMA rate, significantly better than Nvidia's A100. But that's not getting used here, as it's mostly mixed precision matrix math, and the latter's tensor cores generate a higher throughput than the MI250, especially when using the BF16 format.

(the articles themselves also mention that with further software optimizations to ROCm, its likely performance will equalize completely or shift to AMD side).
Mosaic state that:

"We predict that AMD performance will get better as the ROCm FlashAttention kernel is improved or replaced with a Triton-based one: when comparing a proxy MPT model with `n_heads=1` across systems, we found a substantial lift that brings MI250 performance within 94% of A100-40GB and 85% of A100-80GB."

So it's not saying that the performance will equalize or shift to AMD, but it doesn't need to. If AMD's prices are more enticing than Nvidia's, and it can get ROCm fully sorted, then a 6 to 15% performance difference won't matter.
 
That is correct, but Mosaic reported its findings per GPU in the systems tested:

649f22efda90a745c8606d48_Screenshot%202023-06-30%20at%202.45.46%20PM.png


The figures shouldn't come as a surprise, though. AMD's MI250 has a very high FP32/FP64 FMA rate, significantly better than Nvidia's A100. But that's not getting used here, as it's mostly mixed precision matrix math, and the latter's tensor cores generate a higher throughput than the MI250, especially when using the BF16 format.


Mosaic state that:

"We predict that AMD performance will get better as the ROCm FlashAttention kernel is improved or replaced with a Triton-based one: when comparing a proxy MPT model with `n_heads=1` across systems, we found a substantial lift that brings MI250 performance within 94% of A100-40GB and 85% of A100-80GB."

So it's not saying that the performance will equalize or shift to AMD, but it doesn't need to. If AMD's prices are more enticing than Nvidia's, and it can get ROCm fully sorted, then a 6 to 15% performance difference won't matter.
I stand corrected.
 
Back