Stability AI is enhancing its lineup of generative AI models with the introduction of Stable Video 4D, a significant advancement in video generation technology.
While numerous generative AI tools for video creation exist, such as OpenAI's Sora, Runway, Haiper, and Luma AI, Stable Video 4D distinguishes itself by building upon Stability AI's existing Stable Video Diffusion model, which transforms images into videos. This new model not only accepts video input but also generates multiple novel-view videos from eight different perspectives.
Varun Jampani, team lead for 3D Research at Stability AI, shared, “We envision Stable Video 4D being utilized in movie production, gaming, AR/VR, and various applications where there’s a need to view dynamically moving 3D objects from different camera angles.”
Advancing Beyond 2D: From Stable Video 3D to Stable Video 4D
Stable Video 4D represents a leap beyond Stability AI’s previous offering, Stable Video 3D, introduced in March. This earlier model allowed users to create short 3D videos from image or text prompts. In contrast, Stable Video 4D incorporates an additional dimension: time.
Jampani clarified that the four dimensions consist of width (x), height (y), depth (z), and time (t), enabling Stable Video 4D to render a moving 3D object from various angles and within different timeframes.
“Our innovation stems from merging the capabilities of our Stable Video Diffusion and Stable Video 3D models, fine-tuned with a meticulously curated dynamic 3D object dataset,” Jampani explained.
Stable Video 4D innovatively synthesizes novel view videos and generates moving images within a single network, unlike existing models that use separate systems for these functions. Additionally, it employs enhanced attention mechanisms, allowing each video frame to seamlessly connect to neighboring frames across different angles and timestamps, resulting in improved 3D coherence and temporal smoothness.
Differentiating Generative AI Techniques
While generative AI for 2D images often utilizes infill and outfill techniques to complete images, Stable Video 4D operates differently. Rather than relying on partial input data, it comprehensively synthesizes all eight novel view videos from the initial video input as a guide.
“Stable Video 4D synthesizes these videos from scratch without transferring explicit pixel data from input to output. Instead, the network relies on implicit information flow,” Jampani stated.
Stable Video 4D is currently accessible for research evaluation on Hugging Face, with commercial offerings yet to be announced.
According to Jampani, “Stable Video 4D can handle single-object videos several seconds long against plain backgrounds. We aim to extend its capabilities to longer videos and more complex scenes.”