GPT-4 loses its position as "best" LLM to Claude-3 in LMSYS benchmark

Cal Jeffrey

Posts: 4,181   +1,427
Staff member
In context: It seems as if everyone who is anyone has thrown their hats and their money into developing large language models. This AI explosion prompted a need to benchmark them for comparison. So, UC Berkley, UC San Diego, and Carnegie Mellon University researchers formed the Large Language Systems Organization (LMSYS Org or just LMSYS).

Grading large language models and the chatbots that use them is difficult. Other than counting instances of factual mistakes, grammatical errors, or processing speed, there are no globally accepted objective metrics. For now, we are stuck with subjective measurements.

Enter LMSYS's Chatbot Arena, a crowd-sourced leaderboard for ranking LLMs "in the wild." It employs the Elo rating system, which is widely used to rank players in zero-sum games like chess. Two LLMs compete in random head-to-head matches, with humans blind-judging which bot they prefer based on its performance.

Since launching last year, GPT-4 has held the Chatbot Arena's number one position. It has even become the gold standard, with the highest ranking systems described as "GPT-4-class" models. However, OpenAI's LLM was nudged off the top spot yesterday when Anthropic's Claude 3 Opus beat GPT-4 by a slim margin, 1253 to 1251. The beat was so close that the margin of error puts Claude 3 and GPT-4 in a three-way tie for first, with another preview build of GPT-4.

Perhaps even more impressive is Claude 3 Haiku's break into the top ten. Haiku is Anthropic's "local size" model, comparable to Google's Gemini Nano. It is exponentially smaller than Opus, which has trillions of parameters, making it much faster by comparison. According to LMSYS, coming in at number seven on the leaderboard graduates Haiku to GPT-4 class.

Anthropic probably won't hold the top spot for long. Last week, OpenAI insiders leaked that GPT-5 is almost ready for its public debut and should launch "mid-year." The new LLM model is leaps and bounds better than GPT-4. Sources say it employs multiple "external AI agents" to perform specific tasks, meaning it should be capable of reliably solving complex problems much faster.

Image credit: Mike MacKenzie

Permalink to story.

 
Moving from which cpu renders cinebench faster and which GPU has the best FPS to now which LLM has a higher rank. What’s next?
 
Last edited:
Because the workings of an AI really can be quantified with a number, oh wait...
And especially one generated by people where various biases and so on are exposed and different people have different factors for what makes the responses "better" - is it being more concise, more "accurate" on an answer, more responsive, more personalised and tailored output etc., it can be a blind test all it wants (not even double blind), but its still people responding without any clear criteria.
 
Thank you Techspot, for apparently moving away from using ‘AI’ as a description of LLM transformer networks, and moving to describing them as what they are, as well as moving company ‘AI’ Messaging into air quotes. This is a big, big step forward in the messaging surrounding these algorithms.

We need less media parroting the messaging of companies reliant on their ‘AI’ press releases to boost stock prices, and you are helping to buck this trend, and communicate the facts of this technology, rather than its marketing fiction.
 
Translation: Claude-3 produces less crap than GPT-4, but Claude-3 still produces crap and more quickly, too. 🤣

My apologies, but I could not resist the temptation.
 
Translation: Claude-3 produces less crap than GPT-4, but Claude-3 still produces crap and more quickly, too. 🤣

My apologies, but I could not resist the temptation.
People need to understand it is a tool, not a magic source of truth that has answers to everything, ask it anything too technical and it buckles, I use it effectively like a quick guide for basic answers or akin to a search engine so I know where to look next, but to actually use it for foing your dirty work does not end well, especially if you try and do any longer conversations with it, the media needs to stop parading about it being 'AI' and point out it is just a model, like Techspot have thankfully done, its not magic, just really advanced word association.
 
People need to understand it is a tool, not a magic source of truth that has answers to everything, ask it anything too technical and it buckles, I use it effectively like a quick guide for basic answers or akin to a search engine so I know where to look next, but to actually use it for foing your dirty work does not end well, especially if you try and do any longer conversations with it, the media needs to stop parading about it being 'AI' and point out it is just a model, like Techspot have thankfully done, its not magic, just really advanced word association.
I agree. However, I think people, themselves, need to develop better BS detectors. People, IMO, like to look for quick fixes to their problems. If anything comes along that supposedly offers the promise of a quick fix, people should remember that old adage - "If it sounds like it is too good to be true, it probably is" especially where AI is concerned.

Personally, I avoid using AI at all. If the results are only marginally better than using a standard search engine, I don't think its worth my effort given that AI results need to be verified as any search engine results need to be verified. If I cannot count on the results to be accurate, I have no use for it.
 
Back