Key Limitations in Safety Evaluations of AI Models You Should Know

AI Safety Evaluations: Are Current Benchmarks Enough?

As demand for AI safety and accountability rises, a recent report questions the effectiveness of existing tests and benchmarks.

Generative AI models—capable of creating and analyzing text, images, music, and videos—are under intensified scrutiny due to their propensity for errors and unpredictable behavior. In response, organizations ranging from government agencies to major tech companies are proposing new benchmarks to evaluate the safety of these models.

At the end of last year, Scale AI launched a dedicated lab focused on assessing model alignment with safety guidelines. Recently, the National Institute of Standards and Technology (NIST) and the U.K. AI Safety Institute introduced tools aimed at evaluating model risks. However, these assessment methods may not be sufficient.

The Ada Lovelace Institute (ALI), a U.K.-based nonprofit focused on AI research, conducted a comprehensive study interviewing experts from academic labs, civil society, and vendors producing AI models. The findings revealed that while current evaluations offer some insights, they are incomplete, easy to manipulate, and may not accurately predict real-world model behavior.

“Whether it's a smartphone, a prescription drug, or a vehicle, we expect the products we rely on to be both safe and reliable. Industries rigorously test products to confirm safety before deployment,” stated Elliot Jones, senior researcher at ALI and co-author of the report. “Our study explores the limitations of current AI safety evaluations and how they can be effectively utilized by policymakers and regulators.”

Benchmark Overhauls and Red Team Challenges

The research team initially reviewed existing academic literature to outline the potential risks posed by AI models and the current state of evaluations. They subsequently interviewed 16 experts, including representatives from unnamed tech companies creating generative AI solutions.

Significant disagreement emerged within the AI sector regarding the best methods and criteria for model evaluation. Some evaluations only tested models against lab benchmarks, neglecting their real-world impact. Others utilized tests designed for research, rather than production models, despite vendors opting to use them in real applications.

Previous discussions on AI benchmarks highlighted similar concerns, reaffirmed by this study.

Experts pointed out the difficulty of predicting a model's performance based solely on benchmark outcomes, questioning whether these metrics can reliably demonstrate specific capabilities. For instance, although a model may excel in a state bar exam, it may not handle open-ended legal issues effectively.

Additionally, the issue of data contamination was raised, whereby benchmark outcomes may artificially inflate a model’s performance if it has been trained on the same dataset used for testing. The choice of benchmarks is often influenced by convenience rather than effectiveness, experts noted.

“Benchmarks risk manipulation, as developers may train models on the same datasets intended for assessment—akin to previewing the exam questions—while selectively choosing evaluations that suit their purposes,” remarked Mahi Hardalupas, ALI researcher and co-author of the study. “Moreover, even slight model version changes can lead to unpredictable behaviors and disrupt inherent safety mechanisms.”

The study also identified challenges with “red teaming,” where individuals or groups are tasked with testing models for vulnerabilities. Many companies, such as OpenAI and Anthropic, engage in red teaming, but the lack of standardized methods complicates the evaluation of effectiveness. Experts indicated that sourcing skilled red teamers can be challenging, and the labor-intensive nature of this approach poses barriers for smaller organizations lacking adequate resources.

Exploring Solutions for AI Safety

The pressure to deliver models rapidly, coupled with apprehension about identifying issues pre-release, is a primary obstacle in enhancing AI evaluations.

“As one interviewee from a foundation model company noted, there is considerable pressure to expedite model releases, which stifles serious evaluations,” Jones explained. “Leading AI labs are launching models at a pace that outstrips both their and society’s ability to guarantee safety and reliability.”

One interview subject in the ALI study described the challenge of evaluating model safety as “intractable.” So, how can the industry and regulators find a way forward?

Hardalupas believes a solution exists but requires increased involvement from public-sector entities. “Regulators and policymakers need to articulate their expectations for evaluations clearly,” she advised. “At the same time, the evaluation community should be transparent about their limitations and potential.”

She also proposes that governments require public participation in crafting evaluations and establish frameworks to support an ecosystem of independent tests, including regular access to necessary models and datasets.

Jones emphasizes the need for “context-specific” evaluations that move beyond simple prompt-response assessments to consider how various user demographics might interact with a model and how potential adversarial attacks could undermine safeguards.

“This approach necessitates investment in foundational evaluation science to develop more robust and replicable assessments based on a nuanced understanding of AI model operation,” he noted.

However, it’s essential to recognize that there may never be an absolute guarantee regarding a model's safety.

“As others have emphasized, ‘safety’ isn't an inherent property of models,” Hardalupas added. “Understanding whether a model is ‘safe’ involves comprehending its use contexts, the demographics of its users, and whether implemented safeguards are sufficient to mitigate risks. Evaluations can highlight potential risks but cannot assure absolute safety, as many interviewees acknowledged that evaluations can only suggest a model might be unsafe.”

Most people like

Find AI tools in YBX