AMD's poor software optimization is letting Nvidia maintain an iron grip over AI chips

zohaibahd

Posts: 933   +19
Staff
It's the Software, Stupid The year is coming to a close, and AMD had been hoping its powerful new MI300X AI chips would finally help it gain ground on Nvidia. But an extensive investigation by SemiAnalysis suggests the company's software challenges are letting Nvidia maintain its comfortable lead.

SemiAnalysis pitted AMD's Instinct MI300X against Nvidia's H100 and H200, observing several differences between the chips. For the uninitiated, the MI300X is a GPU accelerator based on the AMD CDNA 3 architecture and is designed for high-performance computing, specifically AI workloads.

On paper, the performance figures appear excellent for AMD: the chip offers 1,307 TeraFLOPS of FP16 compute power and a massive 192GB of HBM3 memory, outclassing both of Nvidia's rival offerings. AMD's solutions also promise lower overall ownership costs compared to Nvidia's pricey chips and InfiniBand networks.

However, as the SemiAnalysis crew discovered over five months of rigorous testing, raw specs are not the entire story. Despite the MI300X's impressive silicon, AMD's software ecosystem required significant effort to utilize effectively. SemiAnalysis had to rely heavily on AMD engineers to fix bugs and issues continuously during their benchmarking and testing.

This is a far cry from Nvidia's hardware and software, which they noted tends to work smoothly out of the box with no handholding needed from Nvidia staff.

Moreover, the software woes weren't just limited to SemiAnalysis' testing – AMD's customers were feeling the pain too. For instance, AMD's largest cloud provider Tensorwave had to give AMD engineers access to the same MI300X chips that Tensorwave had purchased, just so AMD could debug the software.

Also read: Not just the hardware: How deep is Nvidia's software moat?

The troubles don't end there. From integration problems with PyTorch to subpar scaling across multiple chips, AMD's software consistently fell short of Nvidia's proven CUDA ecosystem. SemiAnalysis also noted that many AMD AI Libraries are essentially forks of Nvidia AI Libraries, which leads to suboptimal outcomes and compatibility issues.

"The CUDA moat has yet to be crossed by AMD due to AMD's weaker-than-expected software Quality Assurance (QA) culture and its challenging out-of-the-box experience. As fast as AMD tries to fill in the CUDA moat, Nvidia engineers are working overtime to deepen said moat with new features, libraries, and performance updates," reads an excerpt from the analysis.

The analysts did find a glimmer of hope in the pre-release BF16 development branches for the MI300X software, which showed much better performance. But by the time that code hits production, Nvidia will likely have its next-gen Blackwell chips available (though Nvidia is reportedly having some growing pains with that rollout).

Taking these issues into account, SemiAnalysis listed a bunch of recommendations to AMD, starting with giving Team Red's engineers more compute and engineering resources to fix and improve the ecosystem.

SemiAnalysis founder Dylan Patel even met with AMD CEO Lisa Su. He posted on X that she understands the work needed to improve AMD's software stack. He also added that many changes are already in development.

However, it's an uphill climb after years of apparently neglecting this critical component. As much as the analysts want AMD to legitimately compete with Nvidia, the "CUDA moat" looks to keep Nvidia firmly in the lead for now.

Permalink to story:

 
*patiently waiting for comments*

I can tell right now that this thread is gonna blow up with many people atleast having these next few days off
 
Your saying the 3 trillion dollar basically AI only company (with a tiny bit of gaming and some other GPU related chips) offers a better experience than the 204 billion dollar company that does CPU and GPU and has focused on the compute stuff for about a decade less?
Shocker.

AMD still got sales to hyperscalers because once you buy tens of thousands the hardware price difference can be compensated by spending more on software development yourself.
If I was Lisa Su though Id quadruple down on the software development. This must be a problem that can be overcome by hiring the right dozen or so people even if you have pay them silly wages . Steal them from NVIDIA if you have to.

In the end it's a rounding error for a company that big to pay a couple of people high salaries. But those people could send the stock value soaring. I'd offer them high wages, low hours (with generously compensated extra hours if they want it) and all the remote working they want. People with that type of knowledge are rare, give them great working conditions to attract them.
 
Last edited:
Your saying the 3 trillion dollar basically AI only company (with a tiny bit of gaming and some other GPU related chips) offers a better experience than the 204 billion dollar company that does CPU and GPU and has focused on the compute stuff for about a decade less?
Shocker.

AMD still got sales to hyperscalers because once you buy tens of thousands the hardware price difference can be compensate by spending more on software development yourself.
If I was Lisa Su though Id quadruple down on the software development. This must be a problem that can be overcome by hiring the right dozen or so people even if you have pay them silly wages . Steal them from NVIDIA if you have to.

In the end it's a rounding error for a company that big on people who could send the stock value soaring
Well, the funny thing about nVidia of their top talent got stock options years ago and have gotten so absurdly rich that they're not doing their jobs anymore. So nVidia is hiring tons of talent right now at stupid wages because they can afford it.
 
AMD is doing well with inference though. It's not just a software problem, to scale that massive amount of GPUs Nvidia's expensive hardware ecosystem works better.
 
AMD is doing well with inference though. It's not just a software problem, to scale that massive amount of GPUs Nvidia's expensive hardware ecosystem works better.
The only issue is that all the money is going to training right now. That is what massive (size and $$$) GPU clusters are used for, training not inference. That is where nVidai is making bank, on training models. Inferencing is the low compute part. There are many accelerators out there that do inferencing once a model is created, so AMD being good at that is not that special.

I am disappointed in AMD from the software side, it has always been a weak point for them (both graphics and AI). I hope that one of these days they figure that out. No matter how great the hardware they build, it only matters if the software effectively supports it.
 
Amazing how many errors my RX 6800XT generates when doing Generative AI stuff with Topaz software. Performance is way behind Nvidia cards for AI. 2080 Super has much higher TOPS than 6800XT.

Hate to admit it but I think 5070 Ti for my next gen card as I have no confidence in RDNA4 being much better for AI work than RDNA3 as AMD will only focus on Instinct cards. Hopefully I'm wrong, but I doubt it will even support ROCm.
 
I guess the problem stems from the fact that Nvidia had significant first mover advantage here and a deeply entrenched ecosystem that cannot easily be replicated or replaced. So as hard as AMD and Intel are trying to gain further traction here, it will not be something that can happen fast enough.
 
Back