How LLMs are Pioneering a New Era in Robotics Innovation

Home AI News How LLMs are Pioneering a New Era in Robotics Innovation

Updated on October 27 2024

Recent months have witnessed a surge in projects leveraging large language models (LLMs) to develop innovative robotics applications previously thought impossible. The power of LLMs and multi-modal models is enabling researchers to create robots capable of processing natural language commands and executing complex tasks that require advanced reasoning.

This rising interest at the intersection of LLMs and robotics has revitalized the robotics startup landscape, with numerous companies securing substantial funding and showcasing impressive demonstrations.

Are you ready for AI agents? With remarkable advancements in LLMs transitioning into real-world applications, we may be on the brink of a new era in robotics.

Language Models for Perception and Reasoning

Traditionally, building robotic systems necessitated intricate engineering efforts to develop planning and reasoning modules, making it challenging to create user-friendly interfaces that accommodate the diverse ways people issue commands.

The emergence of LLMs and vision-language models (VLMs) has empowered robotics engineers to enhance existing systems in groundbreaking ways. A pivotal project in this area was SayCan, developed by Google Research. SayCan utilized the semantic knowledge embedded in an LLM to assist robots in reasoning about tasks and determining appropriate action sequences.

“SayCan was one of the most influential papers on robotics,” said AI and robotics research scientist Chris Paxton. “Its modular design allows for the integration of different components to create systems capable of compelling demonstrations.”

Following SayCan, researchers have begun exploring the application of language and vision models in diverse ways within robotics, resulting in significant progress. Some projects employ general-purpose LLMs and VLMs, while others focus on tailoring existing models for specific robotic tasks.

“Using large language models and vision models has made aspects like perception and reasoning significantly more accessible,” Paxton observed. “This has rendered many robotic tasks more achievable than ever.”

Combining Existing Capabilities

A major limitation of traditional robotics systems lies in their control mechanisms. Teams can train robots for individual skills, such as opening doors or manipulating objects, but combining these skills for complex tasks can be a challenge, leading to rigid systems requiring explicit instructions.

LLMs and VLMs allow robots to interpret loosely defined instructions and map them to specific task sequences aligned with their capabilities. Interestingly, many advanced models can achieve this without extensive retraining.

“With large language models, I can seamlessly connect different skills and reason about their application,” Paxton explained. “Newer visual language models like GPT-4V illustrate how these systems can collaborate effectively across a variety of applications.”

For instance, GenEM, a technique created by the University of Toronto, Google DeepMind, and Hoku Labs, utilizes the comprehensive social context captured in LLMs to generate expressive robot behaviors. By leveraging GPT-4, GenEM enables robots to understand contexts—like nodding to acknowledge someone’s presence—and execute relevant actions, as informed by its vast training data and in-context learning capabilities.

Another example is OK-Robot, developed by Meta and New York University, which merges VLMs with movement-planning and object-manipulation modules to perform pick-and-drop tasks in unfamiliar environments.

Some robotics startups are thriving amid these advancements. Figure, a California-based robotics company, recently raised $675 million to develop humanoid robots utilizing vision and language models. Their robots leverage OpenAI models to process instructions and strategically plan actions.

However, while LLMs and VLMs address significant challenges, robotics teams must still engineer systems for fundamental skills, such as grasping objects, navigating obstacles, and maneuvering in diverse environments.

“There’s substantial work occurring at the foundational level that these models don’t yet handle,” Paxton said. “This complexity underscores the need for data, which many companies are now working to generate.”

Specialized Foundation Models

Another promising approach involves creating specialized foundation models for robotics that build upon the vast knowledge embedded in pre-trained models while customizing their architectures for robotic tasks.

A major endeavor in this area is Google’s RT-2, a vision-language action (VLA) model that processes perception data and language instructions to generate actionable commands for robots.

Recently, Google DeepMind unveiled RT-X-2, an enhanced version of RT-2 adept at adapting to various robot morphologies while performing tasks not included in its training dataset. Additionally, RT-Sketch, a collaboration between DeepMind and Stanford University, translates rough sketches into executable robot action plans.

“These models represent a new approach, serving as an expansive policy capable of handling multiple tasks,” Paxton remarked. “This is an exciting direction driven by end-to-end learning, where a robot can derive its actions from a camera feed.”

Foundation models for robotics are increasingly entering the commercial arena as well. Covariant recently introduced RFM-1, an 8-billion-parameter transformer model trained on diverse inputs, including text, images, videos, and robot actions, geared towards creating a versatile foundation model for various robotic applications.

Meanwhile, Project GR00T, showcased at Nvidia GTC, aims to enable humanoid robots to process inputs such as text, speech, and videos, translating them into specific actions.

The full potential of language models remains largely untapped and will continue to propel robotics research forward. As LLMs evolve further, we can anticipate groundbreaking innovations in the field of robotics.

Logitech Unveils AI Prompt Builder Software Alongside New Matching Mouse

Zendesk Launches Next-Gen AI-Powered Customer Experience Platform Featuring Advanced Agents and Smart Copilots

Most people like

NoteGPT

Effortlessly summarize videos, articles, and text using AI technology. Engage in conversations with an intelligent AI assistant for enhanced insights. Generate transcripts seamlessly, automate your note-taking process, and effectively manage your folders. Enjoy streamlined productivity with our advanced tools designed for your convenience.

AI Summarization AI YouTube Assistant

Storyboarder.ai

Enhance your storyboarding efficiency using AI technology.

AI storyboard AI Script Writing

TinyWow

TinyWow offers a range of free tools, including AI writing, PDF management, and image editing, designed to simplify your everyday tasks effortlessly.

AI Writing AI Background Remover

Slides Wizard

Create Stunning Presentations in Seconds In today’s fast-paced world, the ability to generate captivating presentations quickly is essential for professionals and students alike. With cutting-edge tools at your fingertips, you can craft eye-catching slides in mere seconds, allowing you to focus on delivering your message effectively. Whether you’re preparing for a business meeting, academic lecture, or creative pitch, our streamlined process empowers you to produce high-quality presentations effortlessly. Say goodbye to hours of design work and hello to instant presentation perfection!

presentation AI Presentation Generator

Find AI tools in YBX