Google is enhancing its robots’ capabilities through Gemini AI, improving navigation and task completion. The DeepMind robotics team published a research paper detailing how Gemini 1.5 Pro's expansive context window allows users to interact with RT-2 robots using natural language commands.
This process involves capturing a video tour of a space, such as a home or office, during which the robot "watches" the footage to learn about the environment. It can then execute commands based on observations, such as guiding users to a power outlet when shown a phone and asked, "Where can I charge this?" DeepMind reports a 90 percent success rate for the Gemini-powered robot across over 50 user instructions in a 9,000-plus-square-foot area.
Researchers also noted “preliminary evidence” that Gemini 1.5 Pro enables robots to plan how to execute tasks beyond basic navigation. For instance, if a user asks if their favorite drink is available amidst a clutter of Coke cans, Gemini directs the robot to navigate to the fridge, check for Cokes, and return with the results. DeepMind intends to explore these findings further.
While Google's video demonstrations are impressive, they reveal that the robot takes between 10 to 30 seconds to process each instruction, as noted in the research. Although we may not be sharing our homes with advanced environment-mapping robots just yet, these machines could soon help us locate our missing keys or wallets.