Since the emergence of large AI models, humanoid robots have rapidly advanced, becoming a focal point in industrial competition. Key developments include Nvidia’s Project GR00T humanoid robot, Figure AI's impressive $675 million funding from major investors like Microsoft and Jeff Bezos, and Tesla's ongoing enhancements to its humanoid robot, Optimus. These changes signify a profound transformation in the field of embodied intelligence.
What significant shifts have large AI models brought to technology and the industry of embodied intelligence? At a recent seminar, Wang Yongcai, an associate professor at Renmin University of China, provided insightful commentary on this subject.
Wang noted that embodied intelligence refers to robots with physical bodies capable of interaction and sensory processing. Traditional robotic perception encompasses five essential processes: robot localization and mapping, path planning, target detection and positioning, robotic arm motion planning, and task execution, such as grasping and organizing. Each process involves extensive research on specific spatial targets, often complicating human-robot interaction due to difficulties humans face in accurately specifying target coordinates.
The advent of large models has improved natural interactions between humans and machines. Robots can now comprehend human language, autonomously devise spatial plans, and move independently. This enables robots to break down tasks into manageable sub-tasks. For instance, when instructed to fetch a glass of water, a robot can deconstruct the command and complete each sub-task autonomously, ultimately delivering the water.
Wang emphasized that traditional embodied intelligence heavily relied on tasks defined by humans. With the integration of large models, these technologies now operate in a more intuitive manner. In essence, large models infuse "life" into embodied intelligence, allowing robots to understand human language, autonomously reason and plan, and effectively break down tasks.
Wang pointed out that advancements in embodied intelligence have exceeded prior expectations, driven by technological innovations. A notable advancement is a navigation model that integrates visual and natural language understanding, enabling robots to learn and adapt by observing human actions and following commands.
He highlighted Figure AI, founded in 2022, which introduced its bipedal humanoid robot, Figure 1. This robot demonstrated walking capabilities and, remarkably, after just ten hours of observational learning, could brew coffee. Additionally, just 14 days after utilizing OpenAI's GPT-4 model, Figure 1 achieved a natural level of interaction with humans.
Wang explained that GPT-4 enhances embodied intelligence by combining natural language and visual instructions, allowing robots to understand and learn from their surroundings. This understanding aids in breaking down tasks into a series of adaptive actions that respond to environmental conditions. These learned actions serve as training data for the ongoing development of embodied intelligence. "While earlier methodologies prioritized computation, the emphasis has now shifted to training large models," Wang noted.
Recently, a team led by Stanford professor Fei-Fei Li launched BEHAVIOR-1K, a new benchmark for embodied intelligence designed to define the tasks we want robots to perform. This benchmark simulates 1,000 daily activities across 50 life scenarios, ultimately aiming to enable robots to execute household service tasks in a human-like manner.
Driven by large models, humanoid robots and related technologies in embodied intelligence are on the brink of significant breakthroughs, setting the stage for rapid industry growth.