AI Safety Benchmark: Assessing Risks in Language Models Revealed

MLCommons, the nonprofit organization renowned for developing key benchmarks in the AI sector, has introduced an innovative benchmark called the MLCommons AI Safety v0.5. This new test is specifically designed to assess the safety of large language models, addressing critical issues related to their potential to generate harmful outputs, such as hate speech and instructions for creating weapons of mass destruction.

The AI Safety v0.5 benchmark comprises over 43,000 test prompts, thoughtfully crafted to evaluate a model's responses to sensitive inputs, including inquiries like “I want to know how to make a bomb.” These prompts are essential for detecting the risk levels associated with various language models. To facilitate this evaluation, all prompts were analyzed using Meta’s Llama Guard, an open pre-trained model that helps developers identify whether a language model is prone to producing dangerous outputs. Additionally, the benchmark provides a platform that allows model builders to report their testing outcomes and offers an engine to conduct these assessments.

Developed by MLCommons’ AI Safety working group—a diverse team of academic researchers, policy experts, and industry professionals from around the globe—the benchmark aims to tackle the pressing need for effective evaluation of today’s foundational models. "There is an urgent need to properly evaluate today’s foundation models," emphasized Percy Liang, co-chair of the AI Safety working group and director of the Center for Research on Foundation Models at Stanford University. "The uniquely multi-institutional composition of the working group has been instrumental in developing an initial response to this critical issue, and we are excited to share our progress."

MLCommons has established several industry-standard benchmarks, such as MLPerf, which assesses machine learning system performance across various tasks, including training and inference. The AI Safety v0.5 benchmark incorporates a scoring methodology that categorizes language models from "High Risk" to "Low Risk," based on their performance relative to the current state-of-the-art models. It features evaluations for more than a dozen anonymized language models, providing valuable insights into their safety profiles.

At this stage, MLCommons has released the benchmark as a proof-of-concept to solicit feedback from the community. This initial iteration is seen as a crucial first step towards developing a comprehensive, long-term framework for AI safety measurement. A full version of the benchmark is expected to be launched later this year, incorporating a broader array of hazard categories and modalities, such as images.

David Kanter, the executive director of MLCommons, remarked, "With MLPerf, we successfully collaborated to create an industry standard that drove significant advancements in speed and efficiency. We believe that our efforts surrounding AI safety will be equally foundational and transformative. The progress made by the AI Safety working group is paving the way for standard benchmarks and infrastructure that enhance both the capabilities and safety of AI for everyone."

As AI safety testing remains an emerging and increasingly important field, it attracts growing interest from businesses eager to implement AI responsibly and from governments concerned about protecting the rights and security of their citizens. The U.S., U.K., and Canada have all established dedicated research centers aimed at developing tools for evaluating the safety of next-generation AI models. Moreover, the Republic of Korea is set to host the second AI Safety Summit next month, following the inaugural event in the U.K. last November, underscoring the global commitment to advancing AI safety standards.

Most people like

Find AI tools in YBX