Researchers at the University of Tokyo and Alternative Machine have developed a humanoid robot system named Alter3, capable of translating natural language commands directly into robotic actions. Leveraging the extensive knowledge embedded in large language models (LLMs) like GPT-4, Alter3 can perform complex tasks such as taking selfies or simulating being a ghost.
This innovation marks a significant advancement in integrating foundational models with robotic systems. While a scalable commercial solution remains on the horizon, recent progress has energized robotics research and held considerable promise.
Transforming Language into Robot Actions
Alter3 utilizes GPT-4 as its core model, processing natural language instructions that describe actions or scenarios for the robot to respond to. The model employs an "agentic framework" to devise a series of action steps required to achieve the specified goal. Initially, it acts as a planner, determining the sequence necessary for the desired task.
Alter3 employs various GPT-4 prompt formats to analyze instructions and map them to robot commands. Since GPT-4 lacks specific training on Alter3's programming commands, researchers exploit its in-context learning to adapt its output to the robot's API. This involves providing a list of commands and illustrative examples on their usage, allowing the model to translate each action step into executable API commands for the robot.
“Previously, we manually controlled all 43 axes in a specific order to replicate human poses or simulate actions like serving tea or playing chess,” the researchers note. “With LLMs, we are liberated from this labor-intensive process.”
Incorporating Human Feedback
Given that language can be imprecise for detailing physical movements, the action sequences generated by the model may not always yield the intended robotic behavior. To address this, researchers have integrated a feedback mechanism enabling users to refine commands, such as “Raise your arm a bit more.” These corrections are processed by another GPT-4 agent, which adjusts the code and returns the revised action sequence for robot execution. The enhanced plans and codes are then stored for future application.
The incorporation of human feedback and memory significantly boosts Alter3's performance. Researchers have evaluated the robot across various tasks, from simple actions like taking selfies and sipping tea to more complex imitations such as acting like a ghost or a snake. The model has also demonstrated its ability to manage scenarios that necessitate intricate planning.
“The training of the LLM encompasses diverse linguistic representations of movements. GPT-4 accurately translates these into commands for Alter3,” the team explains.
With GPT-4's vast understanding of human behavior, it can effectively generate realistic behavior plans for humanoid robots. In experiments, the team also managed to imbue Alter3 with emotional expressions such as embarrassment and joy.
“Even from texts that don’t explicitly mention emotional cues, the LLM can deduce appropriate emotions, reflecting them in Alter3’s physical responses,” the researchers highlight.
Advancements in Robotics Models
The adoption of foundation models in robotics research is rapidly gaining traction. For instance, Figure, valued at $2.6 billion, employs OpenAI models to interpret human commands and execute corresponding real-world actions. With the rise of multi-modal capabilities in foundational models, robotics systems are poised to enhance their environmental reasoning and decision-making.
Alter3 exemplifies a trend where off-the-shelf foundation models serve as reasoning and planning modules within robotic control systems. Importantly, it does not rely on a fine-tuned version of GPT-4, allowing its code to be applicable to other humanoid robots.
Projects such as RT-2-X and OpenVLA utilize specialized foundational models designed to produce robotics commands directly. While these models often yield more stable outcomes and generalize across diverse tasks and environments, they necessitate higher technical expertise and development costs.
Nonetheless, one critical aspect often overlooked in these initiatives is the foundational challenge of enabling robots to perform basic tasks, including grasping objects, maintaining balance, and navigating environments. "A significant amount of work occurs at a level below what these models address," remarked AI and robotics scientist Chris Paxton in a recent interview. "That’s some of the challenging work, largely due to the lack of existing data."