How LLMs are Pioneering a New Era in Robotics Innovation

Recent months have witnessed a surge in projects leveraging large language models (LLMs) to develop innovative robotics applications previously thought impossible. The power of LLMs and multi-modal models is enabling researchers to create robots capable of processing natural language commands and executing complex tasks that require advanced reasoning.

This rising interest at the intersection of LLMs and robotics has revitalized the robotics startup landscape, with numerous companies securing substantial funding and showcasing impressive demonstrations.

Are you ready for AI agents? With remarkable advancements in LLMs transitioning into real-world applications, we may be on the brink of a new era in robotics.

Language Models for Perception and Reasoning

Traditionally, building robotic systems necessitated intricate engineering efforts to develop planning and reasoning modules, making it challenging to create user-friendly interfaces that accommodate the diverse ways people issue commands.

The emergence of LLMs and vision-language models (VLMs) has empowered robotics engineers to enhance existing systems in groundbreaking ways. A pivotal project in this area was SayCan, developed by Google Research. SayCan utilized the semantic knowledge embedded in an LLM to assist robots in reasoning about tasks and determining appropriate action sequences.

“SayCan was one of the most influential papers on robotics,” said AI and robotics research scientist Chris Paxton. “Its modular design allows for the integration of different components to create systems capable of compelling demonstrations.”

Following SayCan, researchers have begun exploring the application of language and vision models in diverse ways within robotics, resulting in significant progress. Some projects employ general-purpose LLMs and VLMs, while others focus on tailoring existing models for specific robotic tasks.

“Using large language models and vision models has made aspects like perception and reasoning significantly more accessible,” Paxton observed. “This has rendered many robotic tasks more achievable than ever.”

Combining Existing Capabilities

A major limitation of traditional robotics systems lies in their control mechanisms. Teams can train robots for individual skills, such as opening doors or manipulating objects, but combining these skills for complex tasks can be a challenge, leading to rigid systems requiring explicit instructions.

LLMs and VLMs allow robots to interpret loosely defined instructions and map them to specific task sequences aligned with their capabilities. Interestingly, many advanced models can achieve this without extensive retraining.

“With large language models, I can seamlessly connect different skills and reason about their application,” Paxton explained. “Newer visual language models like GPT-4V illustrate how these systems can collaborate effectively across a variety of applications.”

For instance, GenEM, a technique created by the University of Toronto, Google DeepMind, and Hoku Labs, utilizes the comprehensive social context captured in LLMs to generate expressive robot behaviors. By leveraging GPT-4, GenEM enables robots to understand contexts—like nodding to acknowledge someone’s presence—and execute relevant actions, as informed by its vast training data and in-context learning capabilities.

Another example is OK-Robot, developed by Meta and New York University, which merges VLMs with movement-planning and object-manipulation modules to perform pick-and-drop tasks in unfamiliar environments.

Some robotics startups are thriving amid these advancements. Figure, a California-based robotics company, recently raised $675 million to develop humanoid robots utilizing vision and language models. Their robots leverage OpenAI models to process instructions and strategically plan actions.

However, while LLMs and VLMs address significant challenges, robotics teams must still engineer systems for fundamental skills, such as grasping objects, navigating obstacles, and maneuvering in diverse environments.

“There’s substantial work occurring at the foundational level that these models don’t yet handle,” Paxton said. “This complexity underscores the need for data, which many companies are now working to generate.”

Specialized Foundation Models

Another promising approach involves creating specialized foundation models for robotics that build upon the vast knowledge embedded in pre-trained models while customizing their architectures for robotic tasks.

A major endeavor in this area is Google’s RT-2, a vision-language action (VLA) model that processes perception data and language instructions to generate actionable commands for robots.

Recently, Google DeepMind unveiled RT-X-2, an enhanced version of RT-2 adept at adapting to various robot morphologies while performing tasks not included in its training dataset. Additionally, RT-Sketch, a collaboration between DeepMind and Stanford University, translates rough sketches into executable robot action plans.

“These models represent a new approach, serving as an expansive policy capable of handling multiple tasks,” Paxton remarked. “This is an exciting direction driven by end-to-end learning, where a robot can derive its actions from a camera feed.”

Foundation models for robotics are increasingly entering the commercial arena as well. Covariant recently introduced RFM-1, an 8-billion-parameter transformer model trained on diverse inputs, including text, images, videos, and robot actions, geared towards creating a versatile foundation model for various robotic applications.

Meanwhile, Project GR00T, showcased at Nvidia GTC, aims to enable humanoid robots to process inputs such as text, speech, and videos, translating them into specific actions.

The full potential of language models remains largely untapped and will continue to propel robotics research forward. As LLMs evolve further, we can anticipate groundbreaking innovations in the field of robotics.

Most people like

Find AI tools in YBX