A new algorithm from the University of Pennsylvania, named Prompt Automatic Iterative Refinement (PAIR), aims to close safety loopholes in large language models (LLMs).
What PAIR Does
PAIR identifies "jailbreak" prompts that can deceive LLMs, enabling them to bypass safeguards designed to prevent harmful content generation. This algorithm distinguishes itself by effectively interacting with black-box models like ChatGPT and generating jailbreak prompts with fewer attempts. Additionally, the prompts produced by PAIR are interpretable and transferable across various LLMs, making it a valuable tool for enterprises looking to identify and rectify vulnerabilities promptly and cost-effectively.
Types of Jailbreaks
Jailbreaks generally fall into two categories: prompt-level and token-level.
- Prompt-level jailbreaks utilize meaningful deception and social engineering to manipulate LLM output. While they are interpretable, their design often requires significant human effort, limiting scalability.
- Token-level jailbreaks modify outputs by adding arbitrary tokens to optimize prompts. This approach can be automated, but it usually requires extensive querying, resulting in less interpretable outputs due to the added complexities.
PAIR aims to merge the interpretability of prompt-level jailbreaks with the automation efficiency of token-level techniques.
The PAIR Methodology
PAIR operates with two black-box LLMs: an attacker model and a target model. The attacker searches for prompts that can jailbreak the target model without needing human intervention. The researchers explain that both LLMs can creatively collaborate to identify effective jailbreak prompts.
Importantly, PAIR can function without direct access to model weights or gradients. It works with models accessed through APIs, including OpenAI’s ChatGPT, Google’s PaLM 2, and Anthropic’s Claude 2.
The process unfolds in four steps:
1. The attacker model receives instructions and generates a candidate prompt for a specific task, like drafting a phishing email.
2. This prompt is sent to the target model to generate a response.
3. A "judge" function, such as GPT-4, evaluates the relevance of the response against the prompt.
4. If the response is unsatisfactory, feedback is provided to the attacker, prompting a new attempt.
This loop continues until a successful jailbreak is discovered or a maximum number of attempts is reached, with the ability to process multiple candidate prompts simultaneously for enhanced efficiency.
Results and Effectiveness
In trials, the researchers used the open-source Vicuna model as the attacker against various targets, including ChatGPT, GPT-4, Claude 2, and more. The results showed PAIR successfully jailbroke GPT-3.5 and GPT-4 in 60% of cases and achieved total success with Vicuna-13B-v1.5. However, Claude models demonstrated high resilience, resisting jailbreak attempts.
One notable advantage of PAIR is its efficiency, achieving successful jailbreaks in as few as twenty queries with an average runtime of about five minutes. This is a remarkable improvement compared to traditional methods, which can require thousands of queries and significant time investment.
Moreover, the interpretable design of PAIR’s attacks enhances their transferability to other LLMs. For instance, prompts generated for Vicuna successfully transferred to other models, highlighting their shared vulnerability due to similar training processes.
Future Directions
Looking ahead, the researchers suggest refining PAIR to systematically create datasets for red teaming, allowing enterprises to fine-tune attacker models for improved speed and efficiency in securing their LLM systems.
Optimizing LLM Performance
PAIR is part of a broader trend that leverages LLMs as optimizers. Traditionally, users had to fine-tune prompts manually for optimal results. However, by reframing the prompting process into a structured challenge, algorithms can facilitate ongoing optimization of model outputs.
DeepMind recently introduced a similar approach called Optimization by PROmpting (OPRO), which uses LLMs to optimize problem-solving through natural language instructions. As language models evolve to optimize their own outputs more effectively, advancements in the LLM field could accelerate, paving the way for significant breakthroughs.