A collaborative team from Abacus.AI, New York University, Nvidia, the University of Maryland, and the University of Southern California has introduced LiveBench, a groundbreaking benchmark aimed at overcoming significant limitations faced by existing industry standards. LiveBench serves as a general-purpose evaluation tool for large language models (LLMs), providing contamination-free test datasets that prior benchmarks often suffer from due to repeated use across various models.
What is a Benchmark?
A benchmark is a standardized test that assesses the performance of AI models through a series of tasks or metrics. It allows researchers and developers to compare results, track advancements, and understand the capabilities of different models.
LiveBench is particularly noteworthy as it includes contributions from AI luminary Yann LeCun, Meta's chief AI scientist, alongside Colin White, Head of Research at Abacus.AI, and several other leading researchers. Goldblum, a key contributor, emphasized the necessity of improved LLM benchmarks, stating that this initiative was driven by the need for freshly generated, diverse questions to eliminate test set contamination.
LiveBench: Key Highlights
The rise of LLMs has underscored the inadequacy of traditional machine learning benchmarks. Most benchmarks are publicly available, and many modern LLMs incorporate vast portions of internet data during training. Consequently, if an LLM encounters benchmark questions during training, its performance can appear artificially high, raising concerns about the reliability of such evaluations.
LiveBench aims to address these shortcomings by releasing updated questions each month sourced from a variety of recent datasets, math competitions, arXiv papers, news stories, and IMDb movie synopses. Currently, there are 960 questions available, each with a verifiable, objective answer that permits accurate scoring without LLM judges.
Task Categories
LiveBench features 18 tasks across six categories, utilizing continuously updated information sources to enhance question diversity and challenge. Below are the task categories:
- Math: Questions sourced from high school math competitions and advanced AMPS problems.
- Coding: Includes code generation and a novel code completion task.
- Reasoning: Challenging scenarios drawn from Big-Bench Hard’s Web of Lies and positional reasoning.
- Language Comprehension: Tasks involving word puzzles, typo removal, and movie synopsis unscrambling.
- Instruction Following: Four tasks focused on paraphrasing, summarizing, and story generation based on recent articles.
- Data Analysis: Tasks that reformat tables, identify joinable columns, and predict column types using recent datasets.
Models are assessed based on their success rates, which should fall between 30% and 70%, reflecting task difficulty.
LiveBench LLM Leaderboard
As of June 12, 2024, LiveBench has evaluated numerous prominent LLMs, revealing that top models have achieved less than 60% accuracy. For instance, OpenAI's GPT-4o leads with an average score of 53.79, followed closely by GPT-4 Turbo at 53.34.
Implications for Business Leaders
Navigating the AI landscape presents challenges for business leaders, particularly in selecting the right LLM. Benchmarks offer reassurance regarding model performance but often fail to provide a complete picture. Goldblum highlights that LiveBench simplifies model comparison, mitigating concerns around data contamination and bias in human evaluations.
Comparison with Existing Benchmarks
The LiveBench team has conducted analyses alongside established benchmarks like LMSYS's Chatbot Arena and Arena-Hard. While LiveBench trends generally align with other benchmarks, specific discrepancies indicate potential biases inherent in LLM judging.
LiveBench is designed as an open-source tool, allowing anyone to use, contribute to, and expand its capabilities. As White notes, effective benchmarks are essential for developing high-performing LLMs, which in turn accelerates model innovation.
Developers can access LiveBench's code via GitHub and utilize its datasets on Hugging Face.