On Tuesday, startup Anthropic introduced a suite of generative AI models that it claims deliver top-tier performance. Shortly after, competitor Inflection AI launched a model that it argues comes close to rivaling some of the leading models, including OpenAI’s GPT-4, in capability and quality.
Anthropic and Inflection are not the first AI companies to assert that their models surpass the competition based on certain benchmarks. Google made similar claims with its Gemini models at launch, while OpenAI promoted GPT-4 and its predecessors—GPT-3, GPT-2, and GPT-1—as high achievers in the field.
But what do these assertions about "state-of-the-art performance" really mean? More importantly: Will a model that statistically "performs" better than another actually provide a noticeable improvement for users?
In short, the answer is likely no.
The core issue lies in the benchmarks that AI firms use to assess their models’ strengths and weaknesses.
Inadequate Metrics
Many of the benchmarks currently utilized for AI models—particularly those for chatbot engines like OpenAI’s ChatGPT and Anthropic’s Claude—fail to reflect how typical users engage with these technologies. For instance, one benchmark mentioned by Anthropic, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), includes hundreds of Ph.D.-level questions in biology, physics, and chemistry. In contrast, most users turn to chatbots for tasks like handling emails, writing cover letters, or discussing personal matters.
Jesse Dodge, a scientist at the Allen Institute for AI, describes this situation as an "evaluation crisis."
"Benchmarks tend to be static and focused on assessing a single aspect, such as a model’s accuracy in a specific domain, or its ability to tackle mathematical reasoning problems," Dodge shared in an interview. "Many of these benchmarks are over three years old, created when AI systems were primarily used for research and had far fewer real-world applications. Moreover, generative AI is used creatively in countless ways."
Outdated Standards
While the most popular benchmarks are not entirely ineffective—they may be useful for answering advanced math questions in some cases—the growing push for generative AI models to serve as versatile, all-encompassing tools renders these established metrics less relevant.
David Widder, a postdoctoral researcher at Cornell focused on AI and ethics, points out that many common benchmarks assess skills that are unrelated to most users’ needs, such as solving elementary math problems or identifying anachronisms in text.
"Earlier AI systems often targeted specific problems in defined contexts, allowing for a more nuanced understanding of what good performance meant in those settings," Widder explained. "As systems are increasingly viewed as ‘general-purpose,’ achieving that understanding becomes more difficult, leading to a reliance on various benchmarks across different fields."
Flaws in Assessment
Beyond the misalignment with real-world use cases, there are concerns regarding whether existing benchmarks accurately measure what they claim to.
An analysis of HellaSwag, a benchmark designed to evaluate commonsense reasoning, revealed that over a third of its questions contained errors or confusing phrasing. Additionally, the MMLU (Massive Multitask Language Understanding), often cited by companies like Google, OpenAI, and Anthropic as proof of their models’ reasoning abilities, features questions solvable through rote memorization rather than genuine understanding.
"[Benchmarks like MMLU focus more on memorization than true comprehension]," Widder stated. "I can quickly retrieve a relevant article to answer a question, but that doesn’t mean I grasp the underlying principle or could apply that understanding to solve complex problems in new contexts. A model struggles with that as well."
Repairing the System
So, what’s the solution to this benchmarking issue?
Dodge believes fixes are possible through increased human involvement. "The way forward combines evaluation benchmarks with human feedback," he suggested. "This involves prompting a model with real user queries and then hiring evaluators to assess the quality of the responses."
Conversely, Widder is less optimistic about the potential for current benchmarks—even with adjustments for obvious errors, like typos—to become genuinely useful for most generative AI users. Instead, he suggests that evaluations should focus on the broader impacts of models and whether those impacts are deemed desirable by users.
"I would recommend identifying specific contextual goals for AI models and determining whether they can achieve these objectives," he advised. "It’s also crucial to consider whether we should even utilize AI in these contexts in the first place."