While large language models (LLMs) are becoming more capable of handling complex tasks, they often struggle to provide accurate answers on the first attempt. This has led to increased interest in enhancing their ability to identify and correct mistakes—a process known as "self-correction." However, existing self-correction methods are limited and frequently fail to meet real-world requirements.
In a groundbreaking paper, researchers at Google DeepMind introduce Self-Correction via Reinforcement Learning (SCoRe), a novel approach that significantly boosts the self-correction abilities of LLMs using only self-generated data. SCoRe is poised to enhance the reliability and robustness of LLMs, opening new avenues for improving their reasoning and problem-solving skills.
"Self-correction greatly enhances human thinking," says Aviral Kumar, a research scientist at Google DeepMind. "Humans often take time to contemplate multiple ideas and rectify their errors, ultimately arriving at the correct solution. We want LLMs to do the same."
An ideal LLM with strong self-correction capabilities should be able to assess and refine its own responses until it arrives at the correct answer. This is crucial because, although LLMs often have the necessary knowledge to solve problems, they may struggle to utilize it effectively in their initial responses.
"From a fundamental machine learning perspective, we do not expect LLMs to solve difficult problems in a single attempt," Kumar explains. "Therefore, we want LLMs to invest more computational effort into thinking and self-correcting to succeed with challenging problems."
Previous attempts at enabling self-correction in LLMs have relied on prompt engineering or fine-tuning models, typically requiring external feedback or guidance from an "oracle." These existing techniques often neglect the model’s intrinsic self-correction abilities. Supervised fine-tuning (SFT) methods, for instance, rely heavily on feedback from human annotators or stronger models, limiting their applicability in real-world scenarios. Additionally, SFT methods sometimes necessitate multiple models during inference for verification, complicating deployment.
DeepMind's research indicates that, while SFT can enhance a model's initial outputs, it falls short when the model must revise answers over several steps—a common requirement for complex problems. "By the end of training, a model may learn to rectify the base model's mistakes but still lack the capacity to detect its own errors," Kumar notes.
Another drawback of SFT is the potential for unintended behavior, whereby the model learns to provide the optimal answer on its first attempt without adjusting it, even if incorrect. "SFT-trained models tend to default to a ‘direct’ strategy, rather than learning the self-correction process," he adds.
Advancements Through Reinforcement Learning
To address these limitations, DeepMind researchers have turn to reinforcement learning (RL).
“Current LLMs do not effectively perform self-correction,” Kumar states. “They are not trained to reflect on past mistakes; instead, they aim to produce the best response to questions. Thus, we developed methods for self-correction."
SCoRe teaches a single model to both generate responses and correct its errors independently, without requiring external feedback. This is achieved through training solely on self-generated data, thereby eliminating dependence on outside information.
Prior RL approaches for self-correction primarily relied on single-turn interactions, resulting in behavior collapse, where the model overlooked self-correction commands in favor of providing a best-guess answer directly from memory. "Naive RL methods led to models ignoring the self-correction prompt, focusing only on producing a zero-shot response," Kumar says.
To combat behavior collapse, SCoRe employs a two-stage training process enhanced by regularization techniques. The first stage optimizes correction performance while ensuring the model's initial responses remain aligned with the base model's outputs. The second stage utilizes multi-turn RL to improve performance at both initial and subsequent attempts, incorporating a reward system that motivates the model to enhance its answers over multiple iterations.
"This dual approach ensures that the model does not merely learn to yield the best first response and minimally adjust it," the researchers explain. "Overall, SCoRe effectively harnesses the base model's knowledge for positive self-correction."
SCoRe in Action
The DeepMind researchers evaluated SCoRe against existing self-correction methods using self-generated data, emphasizing math and coding tasks with benchmarks like MATH, MBPP, and HumanEval.
SCoRe demonstrated significant improvements in the self-correction capabilities of the Gemini 1.0 Pro and 1.5 Flash models, achieving a 15.6% absolute gain on the MATH benchmark and a 9.1% gain on HumanEval compared to the base model, surpassing other self-correction techniques.
The most notable improvement was the model's ability to refine its mistakes from the first to the second attempt while minimizing incorrect alterations to correct answers. SCoRe also proved highly efficient when combined with inference-time scaling strategies, further enhancing performance by distributing the same inference budget across multiple correction rounds.
While the research primarily focuses on coding and reasoning tasks, the team believes SCoRe can have wider applications. "Imagine models capable of recognizing potentially unsafe outputs and independently improving them before user visibility," Kumar suggests.
This work underscores the importance of teaching LLMs how to reason and self-correct, rather than merely mapping inputs to outputs, paving the way for more capable and reliable AI systems.