AMD AI chips are nearly as fast as Nvidia's, MosaicML says

Software is key in LLM training

By Alfonso Maruccia July 3, 2023, 11:16 17 comments

AMD AI chips are nearly as fast as Nvidia's, MosaicML says

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

Why it matters: MosaicML is an AI startup that was recently acquired by Databricks for $1.3 billion. Both companies advocate for a DIY approach to AI systems and LLM training platforms, enabling companies to maintain control over their AI applications. Regarding hardware, MosaicML claims AMD chips can deliver nearly equivalent performance to Nvidia chips.

As Nvidia's recent surge in market capitalization clearly demonstrates, the AI industry is in desperate need of new hardware to train large language models (LLMs) and other AI-based algorithms. While server and HPC GPUs may be worthless for gaming, they serve as the foundation for data centers and supercomputers that perform highly parallelized computations necessary for these systems.

When it comes to AI training, Nvidia's GPUs have been the most desirable to date. In recent weeks, the company briefly achieved an unprecedented $1 trillion market capitalization due to this very reason. However, MosaicML now emphasizes that Nvidia is just one choice in a multifaceted hardware market, suggesting companies investing in AI should not blindly spend a fortune on Team Green's highly sought-after chips.

The AI startup tested AMD MI250 and Nvidia A100 cards, both of which are one generation behind each company's current flagship HPC GPUs. They used their own software tools, along with the Meta-backed open-source software PyTorch and AMD's proprietary software, for testing.

MosaicML trained an LLM algorithm without making any changes to the underlying software code, and found that AMD chips performed nearly as well as those from Nvidia.

In real workload-based tests, MosaicML reports that the LLM training stack remained stable and performed well without any additional configuration. AMD MI250 GPUs were "competitive," the company stated, delivering 80 percent of the per-GPU data throughput offered by Nvidia's A100 40 GB model and within 73 percent compared to the A100 800 GB model.

Hanlin Tang, chief technology officer at MosaicML, states that the major weakness for most companies manufacturing chips for ML algorithm acceleration lies in their software. AMD excelled in this area, and the company is expecting even better performance on new HPC GPUs as software tools continue to improve. It should be mentioned however that CUDA – Nvidia's low-level chip programming framework, has become a sort of standard in the industry, at least for now. CUDA is not perfect, or elegant, or especially easy, but it is familiar and it is Nvidia-only.

AMD is understandably pleased with MosaicML's findings, which seemingly validate the company's strategy of supporting an "open and easy to implement software ecosystem" for AI training and inference on its chips. Nvidia, meanwhile, declined to comment.

17 comments 3K likes and shares

// Related Stories

Featured on TechSpot