2024 is set to be a transformative year at the intersection of generative AI, large foundational models, and robotics. The buzz around their potential applications—from advanced learning techniques to innovative product design—is palpable. Among the pioneers in this field are researchers from Google’s DeepMind Robotics, who have shared insights into their ongoing efforts aimed at enhancing how robots comprehend and respond to human needs.
Historically, robots have been engineered to perform a singular task repeatedly throughout their operational lifespan. While these single-purpose robots excel at their designated functions, they often stumble when faced with unplanned modifications or errors.
The recently unveiled AutoRT aims to leverage large foundational models for various applications. In an illustrative example from the DeepMind team, the system utilizes a Visual Language Model (VLM) to enhance situational awareness. AutoRT can coordinate a fleet of robots, each equipped with cameras, to map out their environment and identify objects within it. Additionally, a large language model suggests feasible tasks for the robots to carry out, optimizing their hardware performance, including that of their end effectors. LLMs are increasingly recognized as crucial tools for enabling robots to interpret natural language commands more effectively, significantly diminishing the need for hard-coded skills.
The AutoRT system has undergone extensive testing over the past seven months, showcasing its capability to manage up to 20 robots simultaneously, handling a total of 52 devices. In this timeframe, DeepMind has conducted around 77,000 trials, incorporating more than 6,000 tasks.
Also noteworthy is the introduction of RT-Trajectory, a new tool that utilizes video input to enhance robotic learning. While many teams are investigating the potential of YouTube videos for large-scale robot training, RT-Trajectory introduces a novel approach by overlaying two-dimensional sketches of robotic arms in motion on the video content. The team explains that “these trajectories, represented as RGB images, furnish practical visual cues to the model as it develops its robot-control strategies.”
According to DeepMind, RT-Trajectory achieved a training success rate of 63%, significantly surpassing the 29% success rate of its RT-2 training across 41 evaluated tasks. The researchers emphasize that RT-Trajectory capitalizes on the vast array of motion data available in existing robot datasets, which has been largely underutilized. “This advancement is not just another step toward crafting robots that can navigate new environments with precision and efficiency, but it also unlocks valuable insights from the wealth of data we already possess.”