DeepMind's GenRM Enhances LLM Accuracy Through Self-Verification of Outputs

Large language models (LLMs) frequently encounter factual and logical errors, particularly with complex reasoning tasks. To mitigate this issue, researchers often implement verifiers or reward models to assess and choose the most accurate responses from a pool of outputs generated by LLMs.

A recent paper from researchers at Google DeepMind, the University of Toronto, Mila, and the University of California, Los Angeles introduces GenRM, an innovative approach that harnesses the generative capabilities of LLMs to enhance verification processes. GenRM serves as a valuable tool for LLM-based applications where traditional verification methods fall short.

Limitations of Classic Verifiers and Reward Models

A prevalent method to boost LLM accuracy involves generating multiple candidate answers and utilizing a distinct component to identify the best one. This necessitates a dependable verifier or reward model. Typically, LLM-based verifiers are trained as discriminative reward models (RMs) that assign numerical scores to assess candidate solutions as correct or incorrect. However, these RMs do not fully capitalize on LLMs' inherent strengths in generating and processing responses.

"Although classic reward models (RMs) / verifiers are trained by fine-tuning LLMs, they do not utilize the text generation capabilities that LLMs are fundamentally designed for," explains Rishabh Agarwal, a co-author of the paper and Senior Research Scientist at DeepMind.

Another common technique, LLM-as-a-Judge, employs advanced prompting methods to evaluate responses. While this approach offers flexibility, it lacks the capabilities gained by reward models during training.

Generative Reward Models

DeepMind’s GenRM presents an alternative by training verifiers through next-token prediction, tapping into the generative strengths of LLMs. "Training RMs via next-token prediction allows them to leverage the myriad benefits of generative LLMs," says Agarwal. "We demonstrated that the same model can verify and generate solutions, employing chain-of-thought reasoning before verification to enhance accuracy."

In GenRM, the verification decision is expressed as a token. For example, to create a score for a solution, the verifier uses a prompt like "Is the answer correct?" and represents the score as the probability of a text token (e.g., "Yes" or "No") based on context.

Given that verification often entails complex reasoning, generative verifiers can significantly benefit from advanced prompting techniques like chain-of-thought (CoT) reasoning, which encourages the model to lay out its thought process prior to arriving at an answer.

"Specifically, we can generate intermediate reasoning steps or critiques (CoT) before deciding on the solution's correctness, potentially uncovering subtle errors overlooked by direct verifiers," the researchers state.

The CoT rationales for training the GenRM model can be derived from either human input or another LLM. During inference, GenRM first produces a CoT rationale and then employs the probability of the "Yes" token to determine a correctness score.

To further enhance the accuracy of CoT verifiers, the researchers utilized majority voting. They sampled multiple CoT chains and computed the average "Yes" score across all samples, effectively optimizing test-time computation.

“GenRM can be conceptualized as merging LLM-as-a-Judge with classic verifiers; it represents a trained LLM-as-a-Judge on domain-specific verification data,” Agarwal explains. “Thus, GenRM is suitable for any area where off-the-shelf prompted LLMs are insufficient.”

GenRM in Action

To assess GenRM’s effectiveness, DeepMind researchers tested it across various reasoning tasks, including last-letter concatenation, word sorting, and word-math problems. They compared GenRM against standard methods, including discriminative reward models, LLM-as-a-Judge, and “self-consistency,” where the model generates multiple answers and selects the most frequent one.

Across all tasks, GenRM using CoT consistently outperformed alternative methods by several percentage points, including specially trained discriminative reward models. On the GSM8K math reasoning benchmark, a Gemma-9B model trained for GenRM achieved a 92.8% problem-solving rate, exceeding the performance of GPT-4 and Gemini 1.5 Pro.

"By unifying solution generation with verification through the next-token-prediction objective, GenRM consistently enhances verification performance across all tasks," the researchers note. "This improvement is evident for both direct and CoT-based generative verifiers, indicating that teaching the verifier to imitate correct solutions generally proves beneficial."

The experiments also revealed that GenRM scales favorably with increasing dataset size and model capacity. Additionally, GenRM with CoT continues to show improvement when sampling a larger number of responses, offering LLM application developers increased flexibility to balance accuracy and computational costs.

"Compared to classic verifiers, GenRM can outperform them using the same data by jointly training on generation and verification, and GenRM training merely involves standard fine-tuning," Agarwal states. "However, to fully leverage GenRM capabilities, we require critiques or verification rationales that clarify the reward label. For high-quality data, this can involve human input, but a more scalable solution would involve synthetic LLM-generated rationales."

Future directions for GenRM could encompass scaling synthetic verification rationales for open-ended generation tasks, integrating GenRM into reinforcement learning pipelines, and utilizing advanced LLM capabilities such as few-shot learning, retrieval-augmented generation, ReAct, and code generation and execution to further enhance verification.

Most people like

Find AI tools in YBX