DeepMind's GenRM Enhances LLM Accuracy Through Self-Verification of Outputs

Home AI News DeepMind's GenRM Enhances LLM Accuracy Through Self-Verification of Outputs

Updated on October 25 2024

Large language models (LLMs) frequently encounter factual and logical errors, particularly with complex reasoning tasks. To mitigate this issue, researchers often implement verifiers or reward models to assess and choose the most accurate responses from a pool of outputs generated by LLMs.

A recent paper from researchers at Google DeepMind, the University of Toronto, Mila, and the University of California, Los Angeles introduces GenRM, an innovative approach that harnesses the generative capabilities of LLMs to enhance verification processes. GenRM serves as a valuable tool for LLM-based applications where traditional verification methods fall short.

Limitations of Classic Verifiers and Reward Models

A prevalent method to boost LLM accuracy involves generating multiple candidate answers and utilizing a distinct component to identify the best one. This necessitates a dependable verifier or reward model. Typically, LLM-based verifiers are trained as discriminative reward models (RMs) that assign numerical scores to assess candidate solutions as correct or incorrect. However, these RMs do not fully capitalize on LLMs' inherent strengths in generating and processing responses.

"Although classic reward models (RMs) / verifiers are trained by fine-tuning LLMs, they do not utilize the text generation capabilities that LLMs are fundamentally designed for," explains Rishabh Agarwal, a co-author of the paper and Senior Research Scientist at DeepMind.

Another common technique, LLM-as-a-Judge, employs advanced prompting methods to evaluate responses. While this approach offers flexibility, it lacks the capabilities gained by reward models during training.

Generative Reward Models

DeepMind’s GenRM presents an alternative by training verifiers through next-token prediction, tapping into the generative strengths of LLMs. "Training RMs via next-token prediction allows them to leverage the myriad benefits of generative LLMs," says Agarwal. "We demonstrated that the same model can verify and generate solutions, employing chain-of-thought reasoning before verification to enhance accuracy."

In GenRM, the verification decision is expressed as a token. For example, to create a score for a solution, the verifier uses a prompt like "Is the answer correct?" and represents the score as the probability of a text token (e.g., "Yes" or "No") based on context.

Given that verification often entails complex reasoning, generative verifiers can significantly benefit from advanced prompting techniques like chain-of-thought (CoT) reasoning, which encourages the model to lay out its thought process prior to arriving at an answer.

"Specifically, we can generate intermediate reasoning steps or critiques (CoT) before deciding on the solution's correctness, potentially uncovering subtle errors overlooked by direct verifiers," the researchers state.

The CoT rationales for training the GenRM model can be derived from either human input or another LLM. During inference, GenRM first produces a CoT rationale and then employs the probability of the "Yes" token to determine a correctness score.

To further enhance the accuracy of CoT verifiers, the researchers utilized majority voting. They sampled multiple CoT chains and computed the average "Yes" score across all samples, effectively optimizing test-time computation.

“GenRM can be conceptualized as merging LLM-as-a-Judge with classic verifiers; it represents a trained LLM-as-a-Judge on domain-specific verification data,” Agarwal explains. “Thus, GenRM is suitable for any area where off-the-shelf prompted LLMs are insufficient.”

GenRM in Action

To assess GenRM’s effectiveness, DeepMind researchers tested it across various reasoning tasks, including last-letter concatenation, word sorting, and word-math problems. They compared GenRM against standard methods, including discriminative reward models, LLM-as-a-Judge, and “self-consistency,” where the model generates multiple answers and selects the most frequent one.

Across all tasks, GenRM using CoT consistently outperformed alternative methods by several percentage points, including specially trained discriminative reward models. On the GSM8K math reasoning benchmark, a Gemma-9B model trained for GenRM achieved a 92.8% problem-solving rate, exceeding the performance of GPT-4 and Gemini 1.5 Pro.

"By unifying solution generation with verification through the next-token-prediction objective, GenRM consistently enhances verification performance across all tasks," the researchers note. "This improvement is evident for both direct and CoT-based generative verifiers, indicating that teaching the verifier to imitate correct solutions generally proves beneficial."

The experiments also revealed that GenRM scales favorably with increasing dataset size and model capacity. Additionally, GenRM with CoT continues to show improvement when sampling a larger number of responses, offering LLM application developers increased flexibility to balance accuracy and computational costs.

"Compared to classic verifiers, GenRM can outperform them using the same data by jointly training on generation and verification, and GenRM training merely involves standard fine-tuning," Agarwal states. "However, to fully leverage GenRM capabilities, we require critiques or verification rationales that clarify the reward label. For high-quality data, this can involve human input, but a more scalable solution would involve synthetic LLM-generated rationales."

Future directions for GenRM could encompass scaling synthetic verification rationales for open-ended generation tasks, integrating GenRM into reinforcement learning pipelines, and utilizing advanced LLM capabilities such as few-shot learning, retrieval-augmented generation, ReAct, and code generation and execution to further enhance verification.

Bestselling Authors Critique National Novel Writing Month's AI-Neutral Position

Enhancing Complex Dataset Queries: How Table-Augmented Generation Outshines Text-to-SQL

Most people like

Papercup - AI Dubbing and Video Translation Software

Papercup revolutionizes video translation by providing automated, human-like voiceovers in various languages. Transform your content effortlessly and reach a global audience with our cutting-edge technology.

AI dubbing Translate

Digitap AI Solutions

In an era where technology revolutionizes industries, AI solutions are significantly reshaping the banking and FinTech sectors. From enhancing customer service to optimizing risk management, the integration of artificial intelligence is proving essential for financial institutions striving for innovation and efficiency. Discover how AI-driven processes are not only streamlining operations but also providing personalized experiences for clients, setting the stage for a new era in finance.

AI-powered API platform Other

Flipped.Chat

Discover the world of AI-driven dating, where immersive character interactions transform your online experience. Our innovative AI dating service brings romance to life, allowing you to engage with dynamic characters that enhance your journey in finding love. Explore connections like never before!

AI dating AI Girlfriend

Slite

Discover Slite, an AI-driven knowledge base designed for seamless access to reliable company information at your fingertips.

AI-powered knowledge base AI Knowledge Base

Find AI tools in YBX