DeepMind and Stanford’s Innovative Robot Control Model Executes Tasks from Sketch Instructions

Recent advances in language and vision models have significantly enhanced robotic systems' ability to follow instructions derived from text or images. However, these methods have limitations.

A new study by researchers from Stanford University and Google DeepMind suggests using sketches as robot instructions. Sketches provide rich spatial information that helps robots perform tasks without the confusion that can arise from the clutter of realistic images or the ambiguity of natural language.

Introducing RT-Sketch

The researchers developed RT-Sketch, a model that utilizes sketches to control robots. This model performs comparably to language- and image-conditioned agents in standard conditions and surpasses them where language and image instructions fall short.

Why Choose Sketches?

While language offers a straightforward means to convey goals, it can be inconvenient for tasks requiring precise manipulations, such as arranging objects. Images depict desired goals with detail, but obtaining a goal image is often impractical. Additionally, pre-recorded images may have excessive details, leading to overfitting and poor generalization to new environments.

“We initially brainstormed about enabling robots to interpret assembly manuals, like IKEA schematics, and carry out necessary manipulations,” said Priya Sundaresan, Ph.D. student at Stanford University and lead author of the study. “Language is often too ambiguous for such spatial tasks, and pre-existing images may not be available.”

The team opted for sketches because they are minimal, easy to produce, and informative. Sketches communicate spatial arrangements effectively without the need for pixel-level detail, allowing models to identify task-relevant objects and enhancing their generalization capabilities.

“We view sketches as a crucial step towards more convenient and expressive ways for humans to instruct robots,” Sundaresan explained.

The RT-Sketch Model

RT-Sketch builds on Robotics Transformer 1 (RT-1), a model that translates language instructions into robot commands. The researchers adapted this architecture to use visual goals, including sketches and images.

To train RT-Sketch, they utilized the RT-1 dataset, which features 80,000 recordings of VR-teleoperated tasks such as object manipulation and cabinet operations. Initially, they created sketches from these demonstrations by selecting 500 examples and producing hand-drawn representations from the final video frames. These sketches, along with corresponding video frames, were used to train a generative adversarial network (GAN) that converts images into sketches.

Training and Functionality

The GAN generated sketches to train the RT-Sketch model, which was further augmented with variations to mimic different hand-drawn styles. During operation, the model accepts an image of the scene and a rough sketch of the desired object arrangement, generating a sequence of commands for the robot to achieve the specified goal.

“RT-Sketch is beneficial for spatial tasks where detailed verbal instructions would be cumbersome or when an image isn't available,” said Sundaresan.

For instance, setting a dinner table might lead to ambiguity with language like "put the utensils next to the plate." This could result in multiple interactions to clarify the model's understanding. In contrast, a simple sketch can clearly indicate the desired arrangement.

“RT-Sketch could also assist in tasks like unpacking items or arranging furniture in a new space, as well as in complex, multi-step tasks such as folding laundry,” Sundaresan added.

Evaluating RT-Sketch

The researchers tested RT-Sketch across various scenarios, evaluating six manipulation skills such as moving objects, knocking cans, and opening drawers. The model performed comparably to existing image- and language-conditioned models for basic manipulation tasks and outperformed language-based models in scenarios where goals were difficult to articulate.

“This indicates that sketches strike an effective balance; they are concise enough to avoid confusion from visual distractions while still preserving necessary semantic and spatial context,” Sundaresan noted.

Future Directions

Looking ahead, researchers plan to explore broader applications for sketches, potentially integrating them with other modalities such as language, images, and human gestures. DeepMind has several robotics models using multi-modal approaches, and the findings from RT-Sketch could enhance these systems. They are also excited about the diverse potential of sketches beyond visual representation.

“Sketches can convey motion with arrows, represent subgoals with partial sketches, and indicate constraints with scribbles, providing valuable information for manipulation tasks we have yet to investigate,” concluded Sundaresan.

Most people like

Find AI tools in YBX