In context: It seems as if everyone who is anyone has thrown their hats and their money into developing large language models. This AI explosion prompted a need to benchmark them for comparison. So, UC Berkley, UC San Diego, and Carnegie Mellon University researchers formed the Large Language Systems Organization (LMSYS Org or just LMSYS).

Grading large language models and the chatbots that use them is difficult. Other than counting instances of factual mistakes, grammatical errors, or processing speed, there are no globally accepted objective metrics. For now, we are stuck with subjective measurements.

Enter LMSYS's Chatbot Arena, a crowd-sourced leaderboard for ranking LLMs "in the wild." It employs the Elo rating system, which is widely used to rank players in zero-sum games like chess. Two LLMs compete in random head-to-head matches, with humans blind-judging which bot they prefer based on its performance.

Since launching last year, GPT-4 has held the Chatbot Arena's number one position. It has even become the gold standard, with the highest ranking systems described as "GPT-4-class" models. However, OpenAI's LLM was nudged off the top spot yesterday when Anthropic's Claude 3 Opus beat GPT-4 by a slim margin, 1253 to 1251. The beat was so close that the margin of error puts Claude 3 and GPT-4 in a three-way tie for first, with another preview build of GPT-4.

Perhaps even more impressive is Claude 3 Haiku's break into the top ten. Haiku is Anthropic's "local size" model, comparable to Google's Gemini Nano. It is exponentially smaller than Opus, which has trillions of parameters, making it much faster by comparison. According to LMSYS, coming in at number seven on the leaderboard graduates Haiku to GPT-4 class.

Anthropic probably won't hold the top spot for long. Last week, OpenAI insiders leaked that GPT-5 is almost ready for its public debut and should launch "mid-year." The new LLM model is leaps and bounds better than GPT-4. Sources say it employs multiple "external AI agents" to perform specific tasks, meaning it should be capable of reliably solving complex problems much faster.

Image credit: Mike MacKenzie