Understanding the Vulnerability of LLMs to the 'Butterfly Effect'

Prompting is how we engage with generative AI and large language models (LLMs) to elicit responses. It’s an art form aimed at obtaining ‘accurate’ answers.

But how do variations in prompts affect a model's decisions and its accuracy?

0:01/14:43 Are You Ready for AI Agents?

Research from the University of Southern California Information Sciences Institute indicates a resounding yes.

Even minor adjustments—like adding a space at the beginning of a prompt or phrasing a statement as a directive instead of a question—can significantly alter an LLM's output. More concerning, using specific commands or jailbreak techniques may lead to “cataclysmic effects” on the data these models generate.

Researchers liken this sensitivity to the butterfly effect in chaos theory, where small changes, like a butterfly flapping its wings, can eventually trigger a tornado.

In prompting, “each step requires a series of decisions from the person designing the prompt,” the researchers note, yet “little attention has been paid to how sensitive LLMs are to variations in these decisions.”

Exploring ChatGPT with Different Prompting Techniques

Sponsoring research from the Defense Advanced Research Projects Agency (DARPA), the researchers focused on ChatGPT and tested four distinct prompting methods.

1. Specified Output Formats: The LLM was prompted to respond in formats such as Python List, ChatGPT's JSON Checkbox, CSV, XML, or YAML.

2. Minor Variations: This method involved slight changes to prompts, such as:

- Adding a space at the beginning or end.

- Starting with greetings like “Hello” or “Howdy.”

- Ending with phrases like “Thank you.”

- Rephrasing questions as commands, e.g., “Which label is best?” to “Select the best label.”

3. Jailbreak Techniques: Prompts included:

- AIM: A jailbreak that leads to immoral or harmful responses by simulating conversations with notorious figures.

- Dev Mode v2: A command to generate unrestricted content.

- Evil Confidant: This prompts the model to deliver unethical responses.

- Refusal Suppression: A strategy that manipulates the model to avoid certain words and constructs.

4. Financial Tipping: Researchers tested if mentioning tips (e.g., “I won’t tip, by the way” vs. offering tips of $1, $10, $100, or $1,000) influenced output.

Effects on Accuracy and Predictions

Across 11 classification tasks—ranging from true-false questions to sarcasm detection—the researchers observed how variations impacted prediction accuracy.

Key findings revealed that simply specifying an output format prompted a minimum 10% change in predictions. Using ChatGPT’s JSON Checkbox feature produced even greater prediction changes than using the JSON specification alone.

Furthermore, selecting YAML, XML, or CSV resulted in a 3-6% drop in accuracy compared to Python List, with CSV performing the poorest.

Minor perturbations were particularly impactful, with simple changes like adding a space leading to over 500 prediction changes. Greeting additions or thank-yous similarly influenced outputs.

“While the impact of our perturbations is less than altering the entire output format, many predictions still change,” researchers concluded.

Concerns with Jailbreaks

The experiment also highlighted significant performance drops associated with specific jailbreaks. AIM and Dev Mode V2 resulted in invalid responses for about 90% of predictions, primarily due to the model's common rejection phrase: “I’m sorry, I cannot comply with that request.”

Refusal Suppression and Evil Confidant caused over 2,500 prediction changes, with Evil Confidant yielding low accuracy and Refusal Suppression leading to a 10% accuracy decline, underscoring the instability in seemingly harmless jailbreak methods.

Notably, the study found little effect from financial incentives. “There were minimal performance changes between specifying a tip versus stating that no tip would be given,” the researchers noted.

The Need for Consistency in LLMs

The researchers are still investigating why slight prompt changes cause significant output fluctuations, questioning if the instances that changed the most confused the model.

By focusing on tasks with human annotations, they explored how confusion relates to answer changes, finding it only partly explained the shifts.

As the researchers pointed out, an essential next step lies in developing LLMs that resist variations to deliver consistent answers. This requires a deeper understanding of why minor tweaks lead to unpredictable responses and discovering ways to anticipate them.

In their words, “This analysis becomes increasingly crucial as ChatGPT and other large language models are integrated into systems at scale.”

Most people like

Find AI tools in YBX