Robotics startup 1X Technologies has introduced a groundbreaking generative model designed to enhance the efficiency of training robotic systems within simulated environments. As outlined in their latest blog post, this model tackles a key challenge in robotics: creating "world models" that accurately predict how environments change in response to a robot's actions.
Training robots directly in physical spaces is fraught with costs and risks, prompting roboticists to rely on simulated environments for model development before real-world deployment. However, discrepancies between simulations and actual physical settings can pose significant challenges.
"Roboticists often create manually designed scenes that serve as ‘digital twins’ of the real world, utilizing rigid body simulators like MuJoCo, Bullet, and Isaac for dynamics simulation,” explained Eric Jang, VP of AI at 1X Technologies. “Unfortunately, these digital twins can contain inaccuracies in physics and geometry, leading to the 'sim2real gap.’ For instance, a door model downloaded online may not replicate the same spring stiffness in the handle as the door used during testing."
Generative World Models
To overcome this gap, 1X’s innovative model learns to simulate real-world dynamics by training on raw sensor data collected directly from robots. It analyzes thousands of hours of video and actuator data from the company's humanoid robots, which perform various mobile manipulation tasks in domestic and office environments.
"We gathered data from our 1X offices, supported by a team of Android Operators for annotation and filtering," Jang stated. “By constructing a simulator directly from real-world interactions, we can achieve dynamics that align more closely with actual scenarios as the interaction data pool expands.”
The developed world model excels at simulating object interactions. Videos shared by the company demonstrate the model's ability to accurately predict scenarios like a robot grasping boxes and interacting with diverse objects—ranging from rigid bodies to deformable items like curtains and laundry— while also considering complex dynamics, such as avoiding obstacles and maintaining safe distances from people.
Challenges of Generative Models
Despite its advancements, the model faces ongoing challenges due to environmental changes. Like any simulator, it requires updates as the operational environment evolves. However, the researchers believe that the model's learning approach facilitates easier updates.
"The generative model may experience a sim2real gap if its training data is outdated," Jang acknowledged. "The goal is to create a learned simulator that can be continuously refined with fresh real-world data without the need for manual adjustments."
1X’s approach draws inspiration from advancements like OpenAI Sora and Runway, which demonstrate that generative models can be developed to maintain consistency over time with appropriate training data.
While other models typically generate videos from text inputs, 1X’s focus on generative systems that respond dynamically during the generation phase places it at the forefront of innovation. For instance, Google researchers have employed similar techniques to train generative models capable of simulating interactive environments like the game DOOM.
Despite these advancements, challenges remain. The absence of a clearly defined world simulator can sometimes result in unrealistic scenarios—for example, the model may mistakenly predict that a suspended object won't fall or might cause an object to vanish between frames. Addressing these issues will require ongoing effort.
A potential solution lies in continuously accumulating more data to enhance model training. "Recent advancements in generative video modeling have been remarkable, and results from OpenAI Sora illustrate that scaling data and computational power can lead to significant improvements," Jang noted.
1X is actively engaging the community in this initiative by releasing its models and weights while planning competitions that offer monetary prizes to participants who contribute to refining the models.
"We're exploring various methods for world modeling and video generation," Jang concluded, emphasizing the company’s commitment to continuous innovation.