Foundation Models and Robotics: The Rise of OpenVLA
Foundation models have significantly advanced robotics by facilitating the development of vision-language-action (VLA) models. These models are capable of generalizing to objects, scenes, and tasks beyond their initial training data. However, their adoption has been limited due to their closed nature and a lack of best practices for deployment and adaptation to new environments.
Introducing OpenVLA
To tackle these challenges, researchers from Stanford University, UC Berkeley, Toyota Research Institute, Google DeepMind, and other institutions have launched OpenVLA, an open-source VLA model trained on a diverse set of real-world robot demonstrations. OpenVLA not only surpasses other models in robotics tasks but also allows for easy fine-tuning to enhance performance in multi-task environments with various objects. Designed for efficiency, it utilizes optimization techniques to run on consumer-grade GPUs with minimal fine-tuning costs.
The Importance of Vision-Language-Action Models
Traditional robotic manipulation methods often struggle with generalization beyond their training scenarios. They are typically ineffective against distractions or unseen objects and have difficulty adapting to slightly altered task instructions. In contrast, large language models (LLMs) and vision-language models (VLMs) excel at generalization due to their extensive internet-scale pretraining datasets. Recently, research labs have started integrating LLMs and VLMs as foundational components for developing robotic policies.
Two prominent approaches include leveraging pre-trained LLMs and VLMs within modular systems for task planning and execution, and building VLAs from the ground up to generate direct robot control actions. Notable examples, such as RT-2 and RT-2-X, have established new benchmarks for generalist robot policies.
However, current VLAs face two major challenges: their closed architecture, which limits transparency in training and data mixture, and the absence of standard practices for deploying and adapting them to new robots and tasks. The researchers emphasize the need for open-source, generalist VLAs to foster effective adaptation, mirroring the existing open-source ecosystem for language models.
The Architecture of OpenVLA
OpenVLA, consisting of 7 billion parameters, builds on the Prismatic-7B vision-language model and includes a dual-part visual encoder for image feature extraction paired with a LLaMA-2 7B language model for processing instructions. Fine-tuned on 970,000 robot manipulation trajectories from the Open-X Embodiment dataset, OpenVLA spans a wide spectrum of robotic tasks and environments while generating action tokens mapped to specific robot actions.
OpenVLA receives natural language instructions alongside input images, reasoning through both to determine the optimal sequence of actions needed to complete tasks like "wipe the table." Remarkably, it outperforms the 55 billion-parameter RT-2-X model, previously deemed state-of-the-art for the WidowX and Google Robot embodiments.
Fine-Tuning and Efficiency
The researchers explored efficient fine-tuning strategies across seven manipulation tasks, showing that fine-tuned OpenVLA policies surpass pre-trained alternatives, particularly when translating language instructions into multi-task behaviors involving various objects. OpenVLA uniquely achieves over a 50% success rate across all tested tasks, positioning it as a reliable option for imitation learning in diverse scenarios.
In pursuit of accessibility and efficiency, the team employed low-rank adaptation (LoRA) for fine-tuning, achieving task-specific adjustments within 10-15 hours on a single A100 GPU—a significant reduction in computational demands. Model quantization further decreased the model's size, enabling deployment on consumer-grade GPUs without sacrificing performance.
Open-Sourcing OpenVLA
The researchers have open-sourced the complete OpenVLA model, along with deployment and fine-tuning notebooks and code for scalable VLA training. They anticipate that these resources will stimulate further exploration and adaptation of VLAs in robotics. The library supports fine-tuning on individual GPUs and can orchestrate billion-parameter VLA training across multi-node GPU clusters, aligning with contemporary optimization and parallelization techniques.
Future developments for OpenVLA aim to incorporate multiple image and proprioceptive inputs, alongside observation history. Furthermore, leveraging VLMs pre-trained on interleaved image and text data may enhance the flexibility of VLA fine-tuning.
With OpenVLA, the robotics community stands at the brink of remarkable advancements, making VLA models more accessible and adaptable for diverse applications.