Title: Google Gemini 1.5 Pro and Flash: Analyzing the Hype Around Long Context Capabilities
One of the standout features of Google's advanced generative AI models, Gemini 1.5 Pro and 1.5 Flash, is their advertised ability to process and analyze vast amounts of data. In various press briefings and demonstrations, Google has touted these models' “long context” capabilities, claiming they can tackle complex tasks like summarizing lengthy documents or searching through film footage. However, recent studies challenge these assertions, revealing that the models may not perform as well as claimed.
Two independent studies explored how effectively Google’s Gemini models, among others, handle extensive datasets—akin to reading “War and Peace.” The results indicated that Gemini 1.5 Pro and 1.5 Flash struggled significantly with accurately answering questions based on large volumes of text, achieving correct responses only 40%-50% of the time in document tests. “Although models like Gemini 1.5 Pro can technically process long contexts, there are numerous instances where they fail to truly ‘understand’ the material,” said Marzena Karpinska, a postdoctoral researcher at UMass Amherst and co-author of one study.
Understanding Context in AI Models
A model’s context, or context window, encompasses the input data it considers before generating responses. This can include straightforward questions like “Who won the 2020 U.S. presidential election?” or more complex data like movie scripts and audio clips. As context windows expand, so does the volume of information these models can interpret.
The latest iterations of Gemini are capable of processing over 2 million tokens as context. To put that in perspective, “tokens” are the smallest units of data, such as individual syllables, equating to roughly 1.4 million words, two hours of video, or 22 hours of audio—the largest context available in any commercial model.
In a briefing earlier this year, Google showcased several pre-recorded demonstrations intended to highlight Gemini’s long-context capabilities. One demonstration featured Gemini 1.5 Pro searching through the transcript of the Apollo 11 moon landing telecast—about 402 pages—looking for humorous quotes and identifying scenes resembling a pencil sketch.
Google DeepMind's VP of Research, Oriol Vinyals, touted the model’s abilities, proclaiming that “[1.5 Pro] performs these sorts of reasoning tasks across every single page, every single word.” However, this claim seems to be an overstatement.
Benchmarking Gemini's Capabilities
In one of the mentioned studies, Karpinska and fellow researchers from the Allen Institute for AI and Princeton assessed how well the models evaluated true/false statements based on contemporary English fiction. They selected recent publications to prevent the models from using prior knowledge, embedding specific details and plot points that necessitated full comprehension of the texts.
For example, in evaluating a statement like “By using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash were tasked with determining its truthfulness and justifying their conclusions. Testing on a novel containing around 260,000 words (approximately 520 pages), researchers found that 1.5 Pro answered correctly just 46.7% of the time, while Flash managed only a 20% accuracy rate. Overall, neither model significantly exceeded random guessing in question-answering accuracy.
Karpinska observed, “The models encountered greater challenges when verifying claims that required synthesizing larger portions of the text compared to those that could be confirmed with sentence-level evidence. They also struggled with implicit information that humans would easily grasp but isn't explicitly articulated.”
The second study, involving researchers from UC Santa Barbara, assessed the reasoning abilities of Gemini 1.5 Flash. This study focused on the model’s ability to interpret video content—searching through materials and answering related questions.
The researchers created a dataset of images (like a birthday cake) paired with queries (e.g., “What cartoon character is on this cake?”). They randomly selected images while integrating “distractor” images to create slideshow-like footage. In a test prompting the model to transcribe six handwritten digits from a sequence of 25 images, Flash achieved approximately 50% accuracy. This fell to about 30% when eight digits were involved.
Michael Saxon, a PhD student at UC Santa Barbara and co-author of the study, stated, “In real question-answering tasks involving images, it appears that all models we analyzed struggled significantly. This minor reasoning task—identifying and reading a number—may be what leads to the model's shortcomings.”
Is Google Overpromising with Gemini?
While neither study has undergone peer review, and they did not examine the 2-million-token context versions of Gemini (both focused on the 1-million-token context), it's important to note that Flash is positioned as a lower-performing alternative to Pro.
Nonetheless, these findings highlight concerns that Google has perhaps overstated the capabilities of Gemini since its inception. Across both studies, including comparisons with OpenAI’s GPT-4 and Anthropic’s Claude 3.5 Sonnet, no models exhibited impressive performance. Yet, Google remains the only provider emphasizing context window size in its promotional efforts.
Saxon remarked, “There’s nothing inherently misleading about stating, ‘Our model can handle X number of tokens,’ based on technical specifications. The real question is: What can be accomplished with that capability?”
As generative AI faces growing scrutiny, many businesses and investors express frustration with its limitations. Recent Boston Consulting Group surveys indicated that around half of C-suite executives doubt generative AI's potential for significant productivity boosts, voicing concerns over errors and data integrity risk. Moreover, PitchBook reported a 76% decline in early-stage generative AI deal-making over two consecutive quarters since its peak in Q3 2023.
With unreliable meeting summaries and AI search tools that often produce misleading information, customers are eager for genuine advancements. Google, which has sometimes stumbled in its efforts to align with generative AI competitors, sought to use Gemini’s context capabilities as a differentiator.
Conclusion: The Need for Realistic Benchmarking
However, this bet appears to have been premature. Karpinska pointed out, “We haven't yet determined how to effectively demonstrate that ‘reasoning’ or ‘understanding’ over extensive documents is genuinely occurring. Virtually every research group releasing these models employs their own improvisational evaluations to substantiate their claims.”
Given the lack of transparency regarding long-context processing implementations, assessing the authenticity of these claims remains challenging.
Both Saxon and Karpinska emphasize that the remedy to inflated generative AI claims lies in improved benchmarking and greater attention to third-party evaluation. Saxon highlighted that one commonly cited long-context test, “needle in the haystack,” merely measures a model's ability to retrieve specific information, such as names and numbers, rather than answering nuanced questions.
“All scientists and most engineers using these models largely agree our current benchmarking practices are flawed,” Saxon summarized, advising the public to approach sweeping claims of “general intelligence across benchmarks” with skepticism.
Update: A previous version incorrectly stated that Gemini 1.5 Pro and 1.5 Flash's accuracy was below random chance in reasoning over long text; their accuracy is actually above random chance. Google has also shared studies suggesting stronger long-context performance than discussed here.