Google’s New Gemini Model Can Analyze Hour-Long Videos—Yet Few Users Are Able to Access It

Google Unveils Gemini 1.5 Pro: A Major Leap in GenAI Capabilities

Last October, a research paper authored by a Google data scientist, Databricks CTO Matei Zaharia, and UC Berkeley professor Pieter Abbeel proposed an innovative approach to enhance GenAI models, including OpenAI's GPT-4 and ChatGPT. This study demonstrated how overcoming a significant memory bottleneck allows these AI models to process considerably larger volumes of data—millions of words instead of the hundreds of thousands previously managed, marking a substantial advancement in AI research.

Gemini 1.5 Pro: Enhanced Data Processing

Today, Google announced the launch of Gemini 1.5 Pro, the latest addition to its Gemini family of GenAI models. This model serves as a direct replacement for Gemini 1.0 Pro (previously known as "Gemini Pro 1.0" due to Google's intricate marketing terminology). Compared to its predecessor, Gemini 1.5 Pro shows significant improvements, particularly in data-processing capabilities.

Gemini 1.5 Pro can handle approximately 700,000 words or 30,000 lines of code, which is 35 times the capacity of the earlier version. Notably, as a multimodal model, it extends beyond text—able to process up to 11 hours of audio or an hour of video across various languages.

Important Clarification on Availability

It's crucial to note that this upper limit pertains to experimental versions. The Gemini 1.5 Pro model being made available to most developers and customers today—through a limited preview—can process only ~100,000 words at a time. Google refers to the high-capacity variant of Gemini 1.5 Pro as "experimental," accessible solely to selected developers participating in a private preview via the company's GenAI development tool, AI Studio. Some customers using Google’s Vertex AI platform also have limited access to this advanced model.

Oriol Vinyals, VP of Research at Google DeepMind, praised these developments, stating, "When interacting with GenAI models, the context produced by input and output is vital. The more complex your queries, the more extensive context the model needs to handle." Vinyals emphasized, "We’ve significantly unlocked long context capabilities."

Understanding Context Windows

The term context window describes the volume of input data (such as text) that a model considers before generating its output (such as responses or additional text). Simple queries like “Who won the 2020 U.S. presidential election?” can fit within this context, as can larger datasets like a movie script or an e-book.

Models with limited context windows often struggle to retain details from recent interactions, which can lead to off-topic responses. In contrast, models with larger context windows are thought to maintain better narrative coherence and deliver more nuanced responses.

Competitive Landscape

Other organizations have strived to engineer models boasting extensive context windows. For instance, AI startup Magic claimed to have built a large language model (LLM) featuring a 5-million-token context window. Concurrent research from researchers at Meta, MIT, and Carnegie Mellon detailed models that could scale to 1 million tokens or more. However, Google stands out by being the first to offer a commercially available model with a context window of this scale, surpassing Anthropic's previous 200,000-token record.

Gemini 1.5 Pro features a maximum context window of 1 million tokens, while the more readily available version has a 128,000-token context window, matching that of OpenAI’s GPT-4 Turbo.

Practical Applications of Gemini 1.5 Pro

So, what does a 1 million-token context window enable? Google asserts numerous practical applications, including the ability to analyze entire code libraries, “reason through” complex documents such as contracts, maintain lengthy conversations with chatbots, and analyze video content.

During the press briefing, Google showcased two demonstrations of Gemini 1.5 Pro utilizing the 1 million-token context window. In the first demo, the model successfully sifted through the Apollo 11 moon landing transcript to extract jokes and identify a scene resembling a pencil sketch. In the second demo, it identified scenes from "Sherlock Jr." based on provided descriptions and sketches.

Performance Speed and Optimization Steps

Although Gemini 1.5 Pro completed the tasks, the processing speed—taking between 20 seconds to one minute for each—was noticeably slower than the average ChatGPT query. Vinyals acknowledged this latency, asserting that improvements are forthcoming. "We're continuously working to optimize latency. This model is still in its experimental phase," he stated, revealing that testing for a 10 million-token context window is already underway.

However, such latency might deter potential users, as waiting minutes to search through video content seems inefficient and unlikely to attract widespread adoption. Concerns also arise around how latency might affect other applications like chatbot interactions or code analysis.

My optimistic colleague noted that despite the delays, overall time efficiency might render the wait worthwhile, depending on the specific use case. For extracting plot points from shows, the delays might be off-putting. Yet, they could prove acceptable when identifying specific screengrabs from vague memories.

Additional Improvements

Beyond its enhanced context window, Gemini 1.5 Pro introduces several quality-of-life enhancements. Google claims that the overall quality of Gemini 1.5 Pro is “comparable” to its flagship Gemini Ultra model, thanks to a newly architected system featuring smaller “expert” models. Gemini 1.5 Pro adeptly breaks tasks into subtasks and delegates them to the relevant expert models based on predictive assessments.

While the use of Mixture of Experts (MoE) is not new, its increasing efficiency has made it a popular choice among model developers, such as those behind Microsoft's language translation services. Evidence of “comparable quality” remains ambiguous due to the complexities of measuring performance in multimodal GenAI models, especially given that many remain in private previews excluded from broad evaluation.

Pricing and Future Implications

During its limited preview, Gemini 1.5 Pro featuring the 1 million-token context window will be free to use. However, Google plans to roll out pricing tiers soon, starting at the standard 128,000-token context window and extending to 1 million tokens.

Given current pricing trends, this larger context window could come with higher costs. While Google did not disclose specific pricing during the briefing, if aligned with Anthropic's model, it might reach $8 per million prompt tokens and $24 per million generated tokens. However, there remains hope for lower costs.

Moreover, the ramifications for other models within the Gemini family, especially Gemini Ultra, are unknown. Will upgrades to Ultra align with Pro advancements, or will there be prolonged periods where Pro models outshine Ultra yet remain marketed as premium options in the Gemini portfolio? It’s a scenario that presents both confusion and curiosity.

Conclusion

As Google continues to develop its Gemini lineup, Gemini 1.5 Pro emerges as a groundbreaking step forward for Generative AI, pushing the boundaries of data processing and context management. Whether these advancements will translate into a favorable user experience remains to be seen, and the landscape of AI will undoubtedly continue to evolve.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles