Maximizing LLM Inference-Time Computing: Insights from DeepMind and UC Berkeley

Given the significant costs and time involved in training large language models (LLMs), the debate continues over whether increased computational resources dedicated to inference can enhance LLM performance without the need for retraining.

A study by researchers at DeepMind and the University of California, Berkeley investigates this concept. Their findings, presented in a recent research paper, indicate that optimizing inference-time computation can lead to considerable performance improvements for LLMs without necessitating larger models or extensive pre-training.

Exploring Inference Strategies for LLMs

Traditionally, enhancing LLM performance has relied on enlarging model size and pre-training compute. However, this method has drawbacks; larger models are costly to develop and resource-intensive to deploy, making them less practical for various applications, especially in resource-constrained environments.

A promising alternative is to increase computational resources during inference to boost the accuracy of LLM responses for complex prompts. This strategy allows for the deployment of smaller LLMs while achieving performance levels comparable to larger, more computationally demanding models.

The key query is: if an LLM is given a fixed amount of inference-time compute, how can we maximize performance through different inference methods, and how does this compare to larger pre-trained models?

The most common approach to enhance test-time computation is best-of-N sampling, where the model generates multiple outputs and selects the most accurate one as the final answer. However, other effective methods exist. For instance, instead of generating multiple responses simultaneously, the model can refine its output through sequential corrections. Additionally, transforming the verification process for selecting the best response or integrating both parallel and sequential sampling with various verification techniques can lead to optimal inference strategies.

Defining Optimal Inference-Time Strategies

To identify the best inference-time strategy, researchers define the "test-time compute-optimal scaling strategy" as the framework that adjusts hyperparameters to maximize performance for specific prompts at test time.

“Ideally, test-time compute should enhance output quality beyond simple sampling from the LLM,” the researchers note.

Strategies for Utilizing Inference-Time Compute

The researchers analyzed two main strategies to improve LLM performance through inference-time compute. The first involves modifying the proposal distribution, enhancing how the LLM generates responses, particularly by fine-tuning the model for iterative answer revisions in complex scenarios.

The second strategy optimizes the verification process—selecting the best answers from generated ones—potentially via a process-based reward model that evaluates the correctness of each response step.

To evaluate their methods, the researchers conducted experiments on the challenging MATH benchmark using PaLM-2 models.

“Efficacy of a specific test-time compute strategy depends critically on the nature of the problem and the base LLM utilized,” they explain.

For simpler tasks, where the base LLM produces reasonable outputs, iterative answer refinement was more effective than parallel sampling. Conversely, for more complex problems requiring diverse solution strategies, resampling multiple responses in parallel or employing tree-search mechanisms with a reward model proved more efficient.

Balancing Test-Time and Pre-Training Compute

The researchers also examined how much test-time computation could substitute for additional pre-training. They compared a smaller model enhanced by increased test-time compute to a model that was 14 times larger with extensive pre-training.

For easier and medium-difficulty questions, the smaller model with test-time compute matched the performance of the larger pre-trained model. This implies a strategic advantage in focusing on pretraining smaller models with less compute and leveraging test-time compute for output enhancement.

However, for the most difficult questions, increased pre-training compute demonstrated superior effectiveness, indicating that current test-time compute strategies cannot wholly replace pre-training in all instances.

The researchers suggest future exploration of more complex strategies that amalgamate various revision and search techniques as well as more efficient methods to gauge question difficulty.

“Overall, our study indicates that scaling test-time computation can prove more beneficial than merely increasing pre-training, with potential for significant improvements as test-time strategies evolve,” they conclude. “In the long run, this leads to a future where fewer FLOPs are consumed during pre-training and more during inference.”

Most people like

Find AI tools in YBX