In recent years, large language models (LLMs) have evolved from processing a few hundred words to managing content equivalent to several books simultaneously. This expanded input capacity, known as the “context window,” is unlocking new applications and use cases that previously required significant engineering effort.
A recent study by researchers at Google DeepMind investigates the “many-shot” in-context learning (ICL) capabilities of LLMs with extended context windows. The findings indicate that by including hundreds or even thousands of training examples within a single prompt, the model’s performance can be enhanced significantly—previously, such improvements would necessitate fine-tuning.
Few-shot vs. Many-shot ICL
ICL allows LLMs to learn new tasks using examples presented during inference. It involves providing the model with a prompt that contains several solved examples along with the problem to be addressed. Traditionally, this type of learning has been referred to as “few-shot learning.”
Unlike fine-tuning, which adjusts the model’s parameters, ICL is user-friendly and more accessible; however, it has been limited by the model's context window. For instance, GPT-3 supported a context window of approximately 2,000 tokens, restricting the number of examples that could fit into a prompt.
Current models, however, can handle over 100,000 tokens, and models like Gemini 1.5 Pro can process more than a million tokens, allowing for hundreds or thousands of examples in each prompt.
In their study, DeepMind researchers examined the impact of many-shot ICL on LLMs' performance across various tasks, including math problem-solving, question-answering, outcome reward modeling, translation of low-resource languages, planning, and sentiment analysis. Some prompts contained up to 8,192 ICL examples, and the results demonstrated that performance improved with the addition of more examples. During translation tasks, long-shot ICL on Gemini Pro achieved record results in Kurdish and Tamil. In summarization tasks, many-shot ICL performance matched that of specialized fine-tuned models, reaching optimal effectiveness only when the in-context examples expanded to hundreds of thousands of tokens.
Reinforced and Unsupervised ICL
A primary challenge of many-shot ICL is the need for large volumes of high-quality human-generated examples, particularly in reasoning tasks. The researchers propose two strategies to mitigate reliance on human-generated data.
The first technique, “reinforced ICL,” substitutes human-crafted examples with model-generated rationales. The LLM creates multiple rationales for a given problem using a few-shot or zero-shot chain-of-thought prompt. Once validated through mechanisms that confirm the correct answers, these responses form an ICL dataset comprising problem/rationale pairs.
The second method, “unsupervised ICL,” taps into the model's innate knowledge of the problem. This approach involves a prompt containing a list of unsolved problems along with a zero-shot or few-shot prompt for a target problem, eliminating the need for human-crafted answers. The researchers hypothesize that when the LLM has the necessary knowledge to solve a task, providing relevant context helps it focus on the internal concepts necessary for problem-solving.
The researchers confirm that both model-generated rationales and problem-only prompts can lessen the dependency on human-generated examples.
Adapting Model Behavior
The study also revealed that many-shot ICL can overcome pre-training biases and effectively learn non-natural language prediction tasks where few-shot ICL might struggle. For example, the researchers altered the labels of a sentiment analysis dataset to contradict the sentiment biases the LLM had acquired during training, and their experiments demonstrated that as more ICL examples were added, performance improved dramatically, nearly matching that of default labels.
Moreover, many-shot ICL was successfully employed to reconfigure the model for linear classification and sequential parity—tasks typically challenging without targeted training. This highlights the potential of many-shot learning to adapt to new tasks and domains that may not align with an LLM’s training data.
Implications for Enterprises
As AI labs work to extend the context windows of LLMs, some experts argue that fine-tuning and other techniques, such as retrieval-augmented generation (RAG), may no longer be necessary. Enterprises could simply craft prompts with pertinent information, examples, and task instructions.
However, many-shot ICL is currently not scalable. For LLM applications receiving tens of millions of requests daily, extending every prompt by a few hundred examples could significantly impact speed and inference costs.
Thus, many-shot ICL can serve as a valuable tool during the exploratory and prototyping phases of LLM applications, allowing developers to experiment with various prompt engineering techniques without the constraints of the context window. Nonetheless, efficient scaling of products will depend on minimizing token consumption and utilizing smaller, faster, and more cost-effective models.