Watch a Robot Skillfully Navigate Google DeepMind Offices with Gemini Technology

Generative AI is making significant strides in the field of robotics, showcasing a variety of applications such as natural language processing, robotic learning, no-code programming, and design innovation. This week, Google's DeepMind Robotics team is highlighting an exciting intersection of these domains: navigation.

In their latest research paper titled “Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs,” the team illustrates how they have implemented Google Gemini 1.5 Pro to enable a robot to understand commands and navigate effectively around an office setting. Notably, DeepMind has utilized some of the Everyday Robots that were part of a project halted last year amid broader layoffs.

In a series of engaging demonstration videos, DeepMind employees initiate interactions with a smart assistant-like prompt: “OK, Robot.” They then request the robot to carry out various tasks within a spacious 9,000-square-foot office environment.

In one instance, a Googler instructs the robot to take them to a place where they can draw. “OK,” the robot replies, sporting a cheerful yellow bowtie, “give me a minute. Thinking with Gemini …” The robot promptly guides the user to a wall-sized whiteboard. In another example, a different individual directs the robot to follow instructions displayed on the whiteboard. A straightforward map directs the robot to the “Blue Area,” and after a brief moment of contemplation, the robot opts for a longer route, ultimately arriving at a robotics testing area. “I’ve successfully followed the directions on the whiteboard,” it declares with a confidence that many humans would envy.

Before these demonstrations, the robots became acclimated to their environment through a process called “Multimodal Instruction Navigation with demonstration Tours (MINT).” This involves guiding the robot through the office while verbally identifying various landmarks. The team then applies hierarchical Vision-Language-Action (VLA) techniques, which merge environmental awareness with common-sense reasoning. By integrating these approaches, the robot gains the ability to respond to written and drawn commands, along with hand gestures.

Google reports that the robot achieved around a 90% success rate through its interactions with more than 50 employees, underscoring the effectiveness of this innovative navigation system.

Most people like

Find AI tools in YBX