Today, at its annual I/O developer conference in Mountain View, Google unveiled a host of announcements centered on artificial intelligence, including Project Astra—an ambitious initiative aimed at developing a universal AI agent for the future.
During the conference, an initial version of the agent was showcased. The goal is to create a multimodal AI assistant that perceives and comprehends its environment, responding in real time to assist with everyday tasks and questions. This concept aligns closely with the recent unveiling of OpenAI's GPT-4o-powered ChatGPT.
Are You Ready for AI Agents?
As OpenAI prepares to roll out GPT-4o for ChatGPT Plus subscribers over the coming weeks, Google is taking a more measured approach with Astra. While Google continues to refine this project, it has not announced a timeline for when the fully operational AI agent will be available. However, some features from Project Astra are expected to be integrated into its Gemini assistant later this year.
What to Expect from Project Astra?
Project Astra—short for Advanced Seeing and Talking Responsive Agent—builds on advancements made with Gemini Pro 1.5 and other task-specific models. It allows users to interact while sharing the nuanced dynamics of their surroundings. The assistant is designed to comprehend what it sees and hears, providing accurate answers in real time.
“To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do,” said Demis Hassabis, CEO of Google DeepMind. “It must take in and remember what it sees and hears to grasp context and take action. Additionally, it should be proactive, teachable, and personal, enabling natural conversations without delays.”
In one demo video, a prototype Project Astra agent running on a Pixel smartphone identified objects, described their components, and interpreted code written on a whiteboard. The agent even recognized the neighborhood through the camera and recalled where the user had placed their glasses.
Google Project Astra in Action
A second demo highlighted similar functionalities, such as an agent proposing enhancements to a system architecture, augmented by real-time overlays visible through glasses.
Hassabis acknowledged the significant engineering challenges involved in achieving human-like response times for the agents. They continuously encode video frames, merging video and speech input into a timeline for efficient recall.
“By leveraging our advanced speech models, we improved the agents' vocal abilities, enabling a richer range of intonations. This enhancement allows agents to better understand their context and respond swiftly,” he added.
In contrast, OpenAI's GPT-4o processes all inputs and outputs in a unified model, achieving an average response time of 320 milliseconds. Google has yet to disclose specific response times for Astra, but latency is expected to improve as development continues. The emotional range of Project Astra agents remains unclear compared to OpenAI's capabilities.
Availability
Currently, Astra represents Google's initial efforts toward a comprehensive AI agent designed to assist with daily tasks, both personal and professional, while maintaining contextual awareness and memory. The company has not specified when this vision will become a tangible product but confirmed that the ability to understand and interact with the real world will be integrated into the Gemini app across Android, iOS, and web platforms.
Initially, the Gemini Live feature will enable two-way conversations with the chatbot. Later this year, updates are expected to incorporate the visual capabilities demonstrated, allowing users to engage with their surroundings through their cameras. Notably, users will also be able to interrupt Gemini during conversations, reflecting a functionality akin to OpenAI's ChatGPT.
“With technology like this, it’s easy to envision a future where individuals have an expert AI assistant at their side, whether through a smartphone or glasses,” Hassabis concluded.