In a recent partnership, AI startup Gradient and cloud computing platform Crusoe have expanded the context window of Llama-3 models to an impressive 1 million tokens. The context window refers to the number of input and output tokens a large language model (LLM) can handle, making it crucial for numerous applications.
Tech companies and leading AI labs are engaged in a fierce competition to enhance the context windows of their LLMs. Within a few months, token support has surged from a few thousand to over a million. However, models with extensive context windows, such as Anthropic Claude (200k tokens), OpenAI GPT-4 (128k tokens), and Google Gemini (1 million tokens), are predominantly available in private settings.
The Need for Open-Source Long-Context LLMs
Gradient collaborates with enterprise clients looking to integrate LLMs into their operations. Even before the release of Llama-3, the company encountered significant context limitations in their customer projects. For instance, coding copilots, essential tools for programming, typically generate short snippets of code. Now, businesses aspire to enhance these capabilities to develop entire code modules.
"In order to achieve this, the language model must reference an entire codebase or multiple GitHub repositories," explained Leo Pekelis, Chief Scientist at Gradient AI. Providing the complete codebase piece by piece would be slow and prone to inaccuracies, as the model wouldn't access the entirety at once.
“Having the ability to input entire codebases into a language model context resolves many issues, enabling more accurate and efficient solutions,” Pekelis added.
Due to restrictions on sending data to third parties, many companies can't utilize private models like Gemini or Claude. This motivated the Gradient team to develop their own open-source model with a 1 million token context.
Open Research Contributions
The commercialization of LLMs has diminished the willingness of AI labs to share discoveries and research. While companies continue to extend context windows, they are less inclined to disclose code, data, or strategies used to optimize their models. Nonetheless, the open research community remains committed to sharing knowledge and advancing models. Gradient drew heavily from research contributions from global universities and institutes.
Using the 8-billion- and 70-billion-parameter versions of Meta’s Llama 3, which has a default context window of 8,000 tokens, they implemented techniques from Berkeley AI Research that facilitated longer context lengths without overwhelming memory and computing resources. The initial code came from an open-source project in Singapore, while key mathematical formulas were sourced from a lab in Shanghai. Performance evaluations were conducted against benchmarks from Nvidia to compare their models with other long-context LLMs like Gemini.
“A lot of this progress wouldn’t have been feasible without the open research community," Pekelis noted. “Open research significantly influences our work across the board.”
Overcoming Compute Challenges
Access to computing resources is a primary challenge in LLM research. Most AI labs depend on large GPU clusters for training and testing. Gradient partnered with Crusoe to investigate long-context LLMs, leveraging Crusoe's specialized AI cloud to explore cost-effective model development.
“The timing was remarkable as we were launching an [Nvidia] L40S cluster,” said Ethan Petersen, Senior Developer Advocate at Crusoe. “We aimed to demonstrate that these chips facilitate extensive training, not just inference.”
Big tech firms are vying for high-end GPUs like the A100, H100, and the upcoming B100, each costing tens of thousands of dollars, with server clusters aggregating into millions. Crusoe offers these GPUs and customizes solutions for clients. Collaborating closely with Gradient, they tailored the L40S cluster, significantly reducing training costs.
"Our approach with partners like Gradient focuses on delivering the most efficient computing solutions based on their needs, and in this instance, the L40S was ideal," stated Patrick McGregor, Chief Product Officer at Crusoe. “We provide tremendous value by customizing compute offerings.”
Pekelis remarked that innovations realized through network optimization on the L40S cluster enabled them to train models quickly, releasing them shortly after Llama-3’s launch. Other cloud providers lack the same level of collaborative flexibility, complicating custom configurations.
Model Evaluation Techniques
One crucial benchmark used to assess long-context windows is the “needle in a haystack” test, where a specific piece of information is tested within a lengthy text sequence.
“Our models achieve near-perfect performance on this test, effective up to a 2 million context length, comparable only to what I’ve seen with Gemini 1.5 Pro," Pekelis said.
Yet, “needle in a haystack” tests may not fully depict a model’s overall context performance. The team also employed more complex evaluations, like multiple “needles in the haystack” or adversarial needles, where conflicting information is introduced.
They assessed their model using Nvidia's RULER benchmark, which includes 13 tasks tailored for evaluating long-context language models with variable sequence lengths and complexities. The team is also enhancing the models' capabilities for many-shot in-context learning, enabling them to adapt to new tasks dynamically by including hundreds or thousands of examples in the prompt.
Enterprise Applications of Long-Context LLMs
Pekelis believes that long-context open models will bridge the gap for companies and developers looking to build LLM-based applications.
“Currently, there’s a noticeable disparity between individual AI applications and enterprise solutions, which are lagging,” he noted. "Enabling language models to handle more information in their context windows opens up new possibilities."
Longer contexts can empower agentic systems—where multiple language models operate together—by processing greater amounts of information with fewer requests. Furthermore, long-context LLMs can simplify complex data processing tasks, such as style imitation.
“Instead of gathering and preprocessing data from various sources to train a model to mimic my writing style, you can simply input all my past emails, and the model learns to write like me,” Pekelis explained.
Furthermore, LLMs with extensive context windows could diminish the reliance on retrieval-augmented generation (RAG), which necessitates fetching relevant documents for every prompt. Hypothetically, an LLM with infinite context could incorporate all documents into the prompt, selecting the most relevant sections per query—though it would still require re-queries for each new chat session due to context limitations.
Enhanced context windows also lower the barriers for creating prototypes and proofs of concept, aiding product teams in grasping the potential of language models.
“Often, educating customers about what’s possible is a critical initial step,” Pekelis concluded. “Developing prototypes or initial examples illustrates the transformative potential for enterprises.”