Google Cloud has introduced two versions of its flagship AI model: Gemini 1.5 Flash and Gemini 1.5 Pro. Gemini 1.5 Flash is a compact multimodal model featuring a 1 million token context window, designed for high-frequency tasks. First unveiled in May at Google I/O, it's now available to developers. The more powerful Gemini 1.5 Pro, introduced in February, boasts an impressive 2 million token context window, making it the most advanced version of Google’s large language model (LLM) to date.
The launch of these Gemini variations demonstrates how Google’s AI technology can empower businesses to create innovative AI agents and solutions. During a recent press briefing, Google Cloud CEO Thomas Kurian highlighted the "incredible momentum" in generative AI adoption, noting that major organizations—including Accenture, Airbus, and Goldman Sachs—are building on Google’s platform. Kurian attributes this surge to the capabilities of Google’s models combined with the Vertex platform, promising rapid advancements in both areas.
Gemini 1.5 Flash
Gemini 1.5 Flash provides developers with lower latency, cost-effective pricing, and a context window ideal for applications such as retail chat agents and document processing. Google claims that, on average, Gemini 1.5 Flash performs 40% faster than GPT-3.5 Turbo when processing 10,000 character inputs. Additionally, it offers a four-times lower input cost than OpenAI’s model and supports context caching for inputs exceeding 32,000 characters.
Gemini 1.5 Pro
Gemini 1.5 Pro features a unique 2 million token context window, allowing it to analyze more text and generate comprehensive responses. Kurian explains that this capability means users can input extensive content, such as a two-hour high-definition video or over 60,000 lines of code, without needing to break it into smaller segments. Many companies are already discovering significant value from this model's enhanced processing power.
Kurian further distinguishes between the two models based on user needs: Gemini 1.5 Pro is perfect for processing lengthy content, while Flash is better suited for low-latency applications.
Context Caching for Gemini 1.5
To help developers maximize the potential of Gemini’s context windows, Google is introducing context caching, now in public preview for both models. This feature allows models to store and reuse previously processed information, significantly reducing computational costs—up to 75%—as it eliminates the need to recompute responses for long conversations or documents.
Provisioned Throughput for Gemini
The newly available provisioned throughput feature enables developers to efficiently scale their use of Gemini models by managing the number of queries a model can handle over time. This option provides enhanced predictability and reliability compared to the previous pay-as-you-go model. Kurian noted that provisioned throughput allows customers to reserve inference capacity, ensuring consistent performance even during spikes in demand, such as those experienced by social media platforms during large events.
Provisioned throughput is now generally available, offering developers greater control over their production workloads and service-level assurances regarding response times and uptime.