Unlocking the Truth: Why Most AI Benchmarks Provide Minimal Insight

Home AI News Unlocking the Truth: Why Most AI Benchmarks Provide Minimal Insight

Updated on October 23 2024

On Tuesday, startup Anthropic introduced a suite of generative AI models that it claims deliver top-tier performance. Shortly after, competitor Inflection AI launched a model that it argues comes close to rivaling some of the leading models, including OpenAI’s GPT-4, in capability and quality.

Anthropic and Inflection are not the first AI companies to assert that their models surpass the competition based on certain benchmarks. Google made similar claims with its Gemini models at launch, while OpenAI promoted GPT-4 and its predecessors—GPT-3, GPT-2, and GPT-1—as high achievers in the field.

But what do these assertions about "state-of-the-art performance" really mean? More importantly: Will a model that statistically "performs" better than another actually provide a noticeable improvement for users?

In short, the answer is likely no.

The core issue lies in the benchmarks that AI firms use to assess their models’ strengths and weaknesses.

Inadequate Metrics

Many of the benchmarks currently utilized for AI models—particularly those for chatbot engines like OpenAI’s ChatGPT and Anthropic’s Claude—fail to reflect how typical users engage with these technologies. For instance, one benchmark mentioned by Anthropic, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), includes hundreds of Ph.D.-level questions in biology, physics, and chemistry. In contrast, most users turn to chatbots for tasks like handling emails, writing cover letters, or discussing personal matters.

Jesse Dodge, a scientist at the Allen Institute for AI, describes this situation as an "evaluation crisis."

"Benchmarks tend to be static and focused on assessing a single aspect, such as a model’s accuracy in a specific domain, or its ability to tackle mathematical reasoning problems," Dodge shared in an interview. "Many of these benchmarks are over three years old, created when AI systems were primarily used for research and had far fewer real-world applications. Moreover, generative AI is used creatively in countless ways."

Outdated Standards

While the most popular benchmarks are not entirely ineffective—they may be useful for answering advanced math questions in some cases—the growing push for generative AI models to serve as versatile, all-encompassing tools renders these established metrics less relevant.

David Widder, a postdoctoral researcher at Cornell focused on AI and ethics, points out that many common benchmarks assess skills that are unrelated to most users’ needs, such as solving elementary math problems or identifying anachronisms in text.

"Earlier AI systems often targeted specific problems in defined contexts, allowing for a more nuanced understanding of what good performance meant in those settings," Widder explained. "As systems are increasingly viewed as ‘general-purpose,’ achieving that understanding becomes more difficult, leading to a reliance on various benchmarks across different fields."

Flaws in Assessment

Beyond the misalignment with real-world use cases, there are concerns regarding whether existing benchmarks accurately measure what they claim to.

An analysis of HellaSwag, a benchmark designed to evaluate commonsense reasoning, revealed that over a third of its questions contained errors or confusing phrasing. Additionally, the MMLU (Massive Multitask Language Understanding), often cited by companies like Google, OpenAI, and Anthropic as proof of their models’ reasoning abilities, features questions solvable through rote memorization rather than genuine understanding.

"[Benchmarks like MMLU focus more on memorization than true comprehension]," Widder stated. "I can quickly retrieve a relevant article to answer a question, but that doesn’t mean I grasp the underlying principle or could apply that understanding to solve complex problems in new contexts. A model struggles with that as well."

Repairing the System

So, what’s the solution to this benchmarking issue?

Dodge believes fixes are possible through increased human involvement. "The way forward combines evaluation benchmarks with human feedback," he suggested. "This involves prompting a model with real user queries and then hiring evaluators to assess the quality of the responses."

Conversely, Widder is less optimistic about the potential for current benchmarks—even with adjustments for obvious errors, like typos—to become genuinely useful for most generative AI users. Instead, he suggests that evaluations should focus on the broader impacts of models and whether those impacts are deemed desirable by users.

"I would recommend identifying specific contextual goals for AI models and determining whether they can achieve these objectives," he advised. "It’s also crucial to consider whether we should even utilize AI in these contexts in the first place."

AI Fraud Detection Software Company Inscribe.ai Cuts 40% of Workforce

Brevian: The No-Code Enterprise Platform for Effortlessly Creating AI Agents

Most people like

SDXL Image Generator

10.5K

Introducing our powerful free AI image generator, designed to transform your creative ideas into stunning visuals effortlessly. With advanced algorithms and user-friendly features, this tool allows you to create high-quality images tailored to your specifications. Whether you’re a designer, marketer, or simply looking to bring your imagination to life, our AI image generator offers endless possibilities at no cost. Explore the future of digital art and unleash your creativity today through our cutting-edge technology!

AI image generator Text to Image

BuzzBoard

47.6K

BuzzBoard is an advanced AI sales platform designed to enhance sales representatives' confidence and drive their success through tailored content.

AI-powered sales platform Sales Assistant

Client Hub

25.9K

Streamline your accountant workflow with our all-in-one solution that includes a seamless client portal and additional powerful features. Maximize efficiency and enhance collaboration in your financial management processes.

workflow management AI Accounting Assistant