Humans frequently use expressive behaviors to convey goals and intentions. For example, we nod to greet a coworker, shake our heads to indicate disapproval, or say "excuse me" to navigate through a crowd. To facilitate smoother interactions with humans, mobile robots must also exhibit similar expressive behaviors. However, this challenge remains significant in robotics, and existing solutions often lack flexibility and adaptability.
In a groundbreaking study, researchers from the University of Toronto, Google DeepMind, and Hoku Labs introduce GenEM, a novel approach harnessing the extensive social context embedded in large language models (LLMs) to enable robots to perform expressive behaviors. By utilizing various prompting methods, GenEM allows robots to interpret their environment and replicate human-like expressions effectively.
Expressive Behaviors in Robotics
Traditionally, creating expressive robot behavior relied on rule- or template-based systems, which demand considerable manual input for each robot and environment. This rigidity means that any changes or adaptations necessitate extensive reprogramming. More modern techniques have leaned towards data-driven approaches that offer greater flexibility, yet these often require specialized datasets tailored to each robot's interactions.
GenEM reshapes this approach by leveraging the rich knowledge within LLMs to generate expressive behaviors dynamically, eliminating the need for traditional model training or convoluted rule sets. For instance, LLMs can recognize the importance of eye contact or nodding in various social contexts.
"Our key insight is to utilize the rich social context from LLMs to create adaptable and composable expressive behaviors,” the researchers explain.
Generative Expressive Motion (GenEM)
GenEM employs a sequence of LLM agents that autonomously generate expressive robot behaviors based on natural language commands. Each agent contributes by reasoning about social contexts and translating these behaviors into actionable API calls for the robot.
“GenEM can produce multimodal behaviors utilizing the robot's capabilities—such as speech and body movement—to clearly express intent,” the researchers note. "One of the standout features of GenEM is its ability to adapt to live human feedback, allowing for iterative improvements and the generation of new expressive behaviors."
The GenEM workflow begins with a natural language instruction, either specifying an expressive action like “Nod your head” or establishing a social scenario, such as “A person walking by waves at you.” Initially, an LLM employs chain-of-thought reasoning to outline a human's potential response. Another LLM agent then translates this into a step-by-step guide reflective of the robot's available functions, guiding actions such as head tilting or triggering specific light patterns.
Next, the procedural instructions are converted into executable code, relying on the robot's API commands. Optional human feedback can be incorporated to refine the behavior further, all without training the LLMs—only prompt-engineering adjustments are required based on robot specifications.
Testing GenEM
The researchers evaluated behaviors generated by two variations of GenEM—one incorporating user feedback and the other not—against scripted behaviors crafted by a professional animator. Utilizing OpenAI’s GPT-4 for context reasoning and expressive behavior generation, they surveyed user responses on the outcomes. The results indicated that users generally found GenEM-generated behaviors equally comprehensible as those of a professional animator. Furthermore, the modular, multi-step method in GenEM vastly outperformed the previous single LLM approach.
Crucially, GenEM's prompt-based design is adaptable to any robot type without necessitating specialized datasets for training. It effectively employs LLM reasoning to create complex expressive behaviors from simple robotic actions.
“Our framework rapidly generates expressive behaviors through in-context learning and few-shot prompting, significantly reducing the need for curated datasets or elaborate rule-making as seen in earlier methods,” the researchers conclude.
While still in its early stages, GenEM has primarily been tested in single interactive scenarios and limited action spaces. There’s potential for exploration in robots with more diverse primitive actions, with large language models promising to enhance these capabilities further.
“We believe our approach offers a flexible framework for generating adaptable and composable expressive motion, harnessing the power of large language models,” the researchers conclude.