Recent advances in vision-language models (VLMs) enable the matching of natural language queries to objects in visual scenes. Researchers are exploring how to integrate these models into robotics systems, which often struggle to generalize their capabilities.
A groundbreaking paper by researchers from Meta AI and New York University presents an open-knowledge-based framework called OK-Robot. This innovative system combines pre-trained machine learning (ML) models to perform tasks in unfamiliar environments, specifically for pick-and-drop operations without the need for additional training.
The Challenges of Current Robotics Systems
Most robotic systems are designed for environments they have previously encountered, limiting their ability to adapt to new settings, particularly in unstructured spaces like homes. Despite significant advancements in various components—such as VLMs excelling at linking language prompts with visual objects and robotics skills improving in navigation and grasping—integrating these technologies still results in suboptimal performance.
The researchers note, "Advancing this problem requires a careful and nuanced framework that integrates VLMs and robotics primitives while remaining flexible enough to incorporate new models from the VLM and robotics communities."
Overview of OK-Robot
OK-Robot integrates cutting-edge VLMs with robust robotics mechanisms to execute pick-and-drop tasks in unseen environments. It employs models trained on extensive publicly available datasets.
The framework consists of three main subsystems: an open-vocabulary object navigation module, an RGB-D grasping module, and a dropping heuristic system. When entering a new space, OK-Robot requires a manual scan, which can be easily conducted using an iPhone app that captures a series of RGB-D images as the user moves through the area. These images, combined with the camera's positioning, create a 3D environment map.
Each image is processed using a vision transformer (ViT) model to extract object information. This data, alongside environmental context, feeds into a semantic object memory module, allowing the system to respond to natural language queries for object retrieval. The memory computes embeddings of voice prompts and matches them to the closest semantic representation. Navigation algorithms then plot the most efficient path to the object, ensuring that the robot has adequate space to grasp the object safely.
Finally, the robot employs an RGB-D camera with an object segmentation model and a pre-trained grasping model to pick up the item. A similar method is applied for navigating to the drop-off point. This system allows the robot to determine the most suitable grasp for varying object types and manage destination locations that may not be level.
"From entry into a completely novel environment to beginning autonomous operations, our system averages under 10 minutes to complete its first pick-and-drop task," the researchers report.
Testing and Results
The researchers evaluated OK-Robot in ten homes, conducting 171 pick-and-drop experiments. It successfully completed full operations 58% of the time, showcasing its zero-shot capabilities—meaning the models were not explicitly trained for these environments. By refining input queries, decluttering spaces, and minimizing adversarial objects, the success rate can exceed 82%.
Despite its potential, OK-Robot has limitations. It occasionally misaligns natural language prompts with the correct objects, struggles with certain grasps, and has hardware constraints. Moreover, the object memory module remains static post-scanning, preventing the robot from adapting to changes in object positioning or availability.
Despite these challenges, the OK-Robot project presents vital insights. Firstly, it demonstrates that current open-vocabulary VLMs excel in identifying diverse real-world objects and navigating to them with zero-shot learning. Additionally, it confirms that specialized robotic models pre-trained on vast data sets can seamlessly facilitate open-vocabulary grasping in novel settings. Lastly, it highlights the potential for combining pre-trained models to accomplish zero-shot tasks without further training, paving the way for future advancements in this emerging field.