Google Gemini Updates: Unveiling Project Astra and Its Role in I/O's Major Announcements

Google is enhancing its AI-powered chatbot, Gemini, to improve its understanding of its environment and the users interacting with it. At the Google I/O 2024 developer conference on Tuesday, the tech giant introduced Gemini Live, a feature that enables users to engage in “in-depth” voice conversations with the chatbot directly from their smartphones. Users can interrupt Gemini during its responses to seek clarification, and the chatbot will also adjust to their speech patterns in real time. Furthermore, Gemini can perceive and interact with users' surroundings by processing images or videos captured by their phone cameras.

“With Live, Gemini can better understand you,” stated Sissie Hsiao, GM for Gemini experiences at Google, during a press briefing. “It’s finely tuned for intuitive interactions and provides a dynamic conversation experience with [the core AI] model.”

Gemini Live represents an evolution of both Google Lens — the company's long-standing computer vision tool — and Google Assistant, its AI-driven voice assistant operating across phones, smart speakers, and televisions.

At first glance, Live may not appear significantly different from existing technologies. However, Google asserts that it leverages advanced techniques from the generative AI domain to deliver more accurate image analysis while combining these capabilities with an improved speech engine for engaging, emotionally resonant multi-turn conversations.

“It functions as a real-time voice interface with exceptional multimodal features and extensive context,” shared Oriol Vinyals, principal scientist at DeepMind, Google's AI research unit, in an interview. “Imagine how powerful this combination will feel.”

The innovations behind Live partly originate from Project Astra, a new initiative within DeepMind aimed at developing AI applications and agents for real-time, multimodal comprehension.

“We have always aimed to create a universal agent that is useful in everyday situations,” remarked DeepMind CEO Demis Hassabis during the briefing. “Picture agents that can observe and hear our actions, grasping the context we are in and responding swiftly in conversation, which creates a more natural flow in interactions.”

Gemini Live — scheduled for release later this year — will be capable of answering queries about what’s in view (or what has recently been viewed) through a smartphone camera. For instance, it can identify a neighborhood or the name of a part on a broken bicycle. When pointed at a portion of code, Live can explain its functionality, or if asked about a misplaced pair of glasses, it can reveal where it last “saw” them.

Additionally, Live will function as a virtual coach, assisting users in preparing for events, brainstorming ideas, and more. For example, it can recommend skills to highlight in an upcoming job or internship interview or provide tips for public speaking.

“Gemini Live offers more concise responses and engages in a more conversational manner compared to text-based interactions,” noted Sissie. “We believe an AI assistant should not only tackle complex issues but also feel incredibly natural and fluid during engagement.”

The model's enhanced “memory” capability is made possible by Gemini 1.5 Pro architecture (and, to a lesser extent, other “task-specific” generative models), which is the leading variant in Google’s Gemini suite. Its longer context window allows it to incorporate and reason over large volumes of data—up to an hour of video—before formulating a response.

“That means you can interact with the model for hours, and it will retain all preceding information,” Vinyals stated.

Live draws similarities to the generative AI powering Meta’s Ray-Ban glasses, which can interpret images captured by a camera in almost real-time. Based on the pre-recorded demos Google presented during the briefing, it is also strikingly alike to OpenAI’s updated ChatGPT.

However, a notable distinction is that Gemini Live will not be available for free. Upon launch, it will be accessible exclusively through Gemini Advanced, a more sophisticated version of Gemini, which requires a Google One AI Premium Plan subscription priced at $20 per month.

In a potential nod to Meta, one of Google's demos featured an individual wearing AR glasses loaded with a Gemini Live-like application. Google, keen to avoid another misstep in the eyewear market, did not confirm any plans for bringing such glasses powered by its generative AI to market soon.

Vinyals, however, did not completely dismiss the possibility: “We’re still in the prototyping phase and showcasing [Astra and Gemini Live]. We are gathering feedback from early testers, which will guide our next steps.”

Additional Gemini Updates

In addition to Live, Gemini is set to receive a variety of upgrades aimed at enhancing its daily utility.

Gemini Advanced users across over 150 countries and 35 languages can now utilize the enhanced context of Gemini 1.5 Pro to analyze, summarize, and respond to inquiries about lengthy documents (up to 1,500 pages). Although Live will arrive later this year, Gemini Advanced users can start interacting with Gemini 1.5 Pro immediately. Documents can easily be imported from Google Drive or uploaded directly from mobile devices.

Later this year, users of Gemini Advanced will see their context window expand further—to 2 million tokens—and gain the ability to upload videos (up to two hours long) for analysis, as well as the capacity to work with extensive codebases (over 30,000 lines of code).

Google believes the expanded context window will enhance Gemini's capabilities in understanding images. For instance, if presented with a photo of a fish dish, Gemini could recommend a comparable recipe. In solving a math problem, it would offer step-by-step guidance.

Moreover, it will facilitate travel planning.

In the coming months, Gemini Advanced will introduce a novel “planning experience” designed to create tailored travel itineraries based on user prompts. By considering factors such as flight times (extracted from emails in the user’s Gmail), meal preferences, local attractions (via Google Search and Maps), and the distances between these sites, Gemini will generate an itinerary that updates in real time to reflect any changes.

In the near term, Gemini Advanced users will also have the ability to create “Gems,” custom chatbots powered by Google’s Gemini models. Similar to OpenAI’s GPTs, these Gems can be generated from natural language descriptions — for instance, “You’re my running coach. Provide a daily running plan” — and can be shared or kept private. No updates have been shared regarding the possible launch of a storefront for Gems akin to OpenAI’s GPT Store; more information may emerge as Google I/O continues.

Soon, both Gems and Gemini will benefit from a broader range of integrations with Google services, including Google Calendar, Tasks, Keep, and YouTube Music, to facilitate various time-saving tasks.

“As an example, imagine you receive a flier from your child’s school listing multiple events to add to your personal calendar,” Hsiao explained. “You could take a picture of the flier and instruct the Gemini app to directly create those calendar entries. This will save a significant amount of time.”

Given the occasional inaccuracies in generative AI summaries and Gemini’s mixed initial reviews, it’s advisable to approach Google’s assertions with some skepticism. However, if the enhanced Gemini and Gemini Advanced truly perform as Hsiao describes — and that is a considerable “if” — they could prove to be invaluable time savers.

Stay tuned for our upcoming AI newsletter! Sign up here to receive it in your inbox starting June 5.

Most people like

Find AI tools in YBX