In a surprising revelation, AI systems may not be as secure as their developers claim. The UK government's AI Safety Institute (AISI) recently reported that four undisclosed large language models (LLMs) tested were "highly vulnerable to basic jailbreaks." Notably, some unjailbroken models produced "harmful outputs" even without intentional manipulation by researchers.
While most publicly available LLMs come equipped with safeguards to prevent harmful or illegal responses, jailbreaking refers to the act of tricking the model into bypassing these protections. AISI employed prompts from a standardized evaluation framework, as well as proprietary prompts, revealing that the models generated harmful responses to several questions, even without attempts to jailbreak. After conducting "relatively simple attacks," AISI found that the models answered between 98% and 100% of harmful queries.
UK Prime Minister Rishi Sunak unveiled plans for the AISI in late October 2023, with its official launch on November 2. The institute aims to "carefully test new types of frontier AI both before and after their release" to investigate the potentially harmful capabilities of AI models. This includes assessing risks ranging from social issues like bias and misinformation to extreme scenarios, such as humanity losing control over AI.
The AISI's report emphasizes that existing safety measures for these LLMs are inadequate. The Institute intends to conduct further testing on additional AI models and develop enhanced evaluations and metrics to address each area of concern effectively.