Google's Generative AI Now Capable of Analyzing Extensive Video Content Hours Efficiently

Gemini, Google’s innovative suite of generative AI models, has significantly expanded its capabilities, now maintaining the ability to analyze lengthy documents, codebases, videos, and audio recordings more effectively than ever.

During a keynote at the Google I/O 2024 developer conference on Tuesday, Google introduced a private preview of Gemini 1.5 Pro, its flagship model upgraded to handle an impressive 2 million tokens—double the previous limit.

With the ability to process 2 million tokens, Gemini 1.5 Pro now supports the largest input of any generative AI model available commercially. Anthropic’s Claude 3 comes in second, with a maximum of 1 million tokens. In the context of AI, "tokens" refer to segments of data, such as the syllables “fan,” “tas,” and “tic” in the word “fantastic.” To illustrate, 2 million tokens equate to approximately 1.4 million words, two hours of video, or 22 hours of audio.

In addition to handling larger files, models that support increased token inputs often exhibit improved performance. Unlike smaller models with limited context, the 2-million-token Gemini 1.5 Pro retains more recent conversation content, reducing the likelihood of distraction from the topic at hand. These large-context models can better follow data flows, leading to richer and more relevant responses.

Developers eager to try out Gemini 1.5 Pro's enhanced 2-million-token context can join a waitlist via Google AI Studio, the platform for Google’s generative AI development tools. (A version with a 1-million-token context is expected to be broadly available across Google's developer services in the coming month.)

In addition to an expanded context window, Google has announced several algorithmic enhancements that bolster the capabilities of Gemini 1.5 Pro in areas such as code generation, logical reasoning, multi-turn engagements, and understanding audio and images. Furthermore, the recent updates allow Gemini to reason with audio as well as images and videos, utilizing a feature known as system instructions to guide its processes.

For developers with less demanding needs, Google is introducing Gemini 1.5 Flash, a streamlined model specifically designed for high-frequency generative AI tasks. Available in public preview, Flash also supports a 2-million-token context window but focuses on faster, text-only output from multimodal inputs like audio, video, and images.

“While Gemini Pro is tailored for complex, multi-step reasoning tasks, Flash is ideal for situations where rapid model output is essential,” explained Josh Woodward, VP of Google Labs, during a media briefing. He added that Flash is particularly beneficial for summarizing, chat applications, captioning images and videos, and extracting data from extensive documents and tables.

Flash seems to position Google competitively against smaller, budget-friendly models like Anthropic’s Claude 3 Haiku. Both Gemini 1.5 Pro and Flash are now widely accessible in over 200 countries and territories, including the European Economic Area, the U.K., and Switzerland. However, access to the 2-million-token context version remains available through a waitlist.

In an additional move aimed at cost-conscious developers, Google’s Gemini models, not just Flash, will soon utilize a context caching feature. This will allow developers to store significant information—like knowledge bases or research paper databases—in a cache for quick and economical access.

A complementary Batch API, currently in public preview on Vertex AI, Google’s enterprise-focused generative AI development platform, will also enable a more cost-effective means to manage various workloads, including classification, sentiment analysis, data extraction, and description generation by allowing multiple prompts to be sent to Gemini models in a single request.

Another feature set to launch later this month in preview on Vertex is controlled generation, which could provide additional cost savings by allowing users to specify output formats or schemas (such as JSON or XML) for the Gemini models.

“You’ll be able to send all your files to the model at once, eliminating the need to resend them repeatedly,” Woodward noted. “This will enhance the utility of the long context while also making it more affordable.”

Most people like

Find AI tools in YBX