In a recent study, researchers from Meta, Ecole des Ponts ParisTech, and Université Paris-Saclay propose a new approach to enhance the accuracy and speed of AI large language models (LLMs) by enabling them to predict multiple tokens simultaneously. This innovation challenges the traditional auto-regressive model design, which predicts one token at a time.
The Benefits of Multi-Token Prediction
While multi-token prediction is not suitable for every LLM or language task, it offers significant advantages in specific scenarios, such as accelerating generative tasks at speeds up to three times faster than conventional methods. Although there is still potential for refinement, this technique could serve as a powerful tool in certain LLM applications.
Challenges of Next-Token Prediction
The traditional method of training LLMs is called "next-token prediction." This self-supervised learning technique involves presenting the model with a sequence of tokens, prompting it to predict the next token, which is then added to the input for further predictions. This iterative process, applied to extensive text corpora, enables the model to learn to generate coherent text.
However, researchers have identified limitations of the next-token prediction approach in developing language processing, knowledge acquisition, and reasoning skills. By concentrating solely on one token at a time, models risk becoming overly sensitive to local patterns and may neglect reasoning that requires a broader context. Additionally, next-token prediction demands vast datasets to achieve fluency levels that humans attain with less text.
Meta's recent study posits that "training language models to predict multiple future tokens at once results in higher sample efficiency."
Exploring Multi-Token Prediction
In contrast, multi-token prediction directs the LLM to predict several future tokens at each position in the training data simultaneously. The researchers introduce a straightforward multi-token prediction architecture that does not impose additional training time or memory requirements.
This model builds on the established Transformer architecture, which is foundational for most LLMs, but with modifications. Instead of generating a single output, it includes multiple independent output heads for each token prediction.
Implementation of Multi-Token Prediction
During inference, the model employs the traditional next-token prediction method for each output head, utilizing the extra heads to streamline the decoding process. The framework leverages prior work in the field.
"While cost-effective and simple, multi-token prediction significantly enhances the training of faster, more powerful Transformer models," the researchers state.
Results and Observations
The team tested their multi-token prediction strategy with models ranging from 300 million to 13 billion parameters. Their findings reveal notable patterns: smaller models exhibit less benefit from multi-token prediction, which becomes increasingly effective as model size grows. For instance, models trained for 4-token predictions showed marked performance improvements of several percentage points over single-token predictions on the MBPP coding benchmark.
The researchers conclude, "It is possible, using the same computational resources, to achieve greater performance from large language models when employing multi-token prediction."
Moreover, multi-token prediction enhances inference speeds, making models up to three times faster across varying batch sizes. "Pretraining with multi-token prediction enhances the accuracy of additional heads compared to merely fine-tuning a next-token prediction model, unlocking the full potential of self-speculative decoding," they explain.
The study also highlights that multi-token prediction encourages the model to learn longer-term patterns, particularly in experiments with "byte-level tokenization," where each byte is treated as a single token. In these cases, multi-byte prediction significantly outperformed the baseline single-byte models, which is crucial for applications lacking a predefined vocabulary.
Future Directions for Research
Despite its advantages, multi-token prediction is not without challenges. Determining the optimal number of predicted tokens varies by task and model size. The researchers are exploring future research avenues, including automated techniques to identify the best number of tokens to predict and the dynamics between vocabulary sizes and multi-token strategies.
This research holds promise for enterprise applications, potentially delivering enhanced inference speeds and improved accuracy for generative tasks like code completion—without major alterations to the existing LLM architecture, ensuring compatibility with other optimization techniques within the Transformer framework.