Stability AI expands its generative AI model offerings with the launch of Stable Video 3D (SV3D).
As the name suggests, SV3D is a generative AI video tool designed to render 3D videos. Building on the foundational Stable Video technology, which allows users to create short videos from images or text prompts, SV3D enhances video capabilities for novel view synthesis and 3D generation, making substantial improvements over the previous Stable Video Diffusion model.
With SV3D, Stability AI adds significant depth to its video generation technology, enabling the creation and transformation of multi-view 3D meshes from a single input image. This model is now available for commercial use with a Stability AI Professional Membership, priced at $20 per month for creators and developers earning less than $1 million annually. For non-commercial purposes, users can download model weights from Hugging Face.
Here’s a quick video demonstration I generated. While there may be slight distortions, the object forms in the video remain coherent and stable as the camera rotates.
Target Use Cases: Game Creation and E-Commerce
“By adapting our Stable Video Diffusion image-to-video model with camera path conditioning, Stable Video 3D generates multi-view videos of an object,” the company noted in a blog post about the new model.
“Stable Video 3D is particularly valuable for generating 3D assets in the gaming sector,” said Varun Jampani, lead researcher at Stability AI. “It also produces 360-degree orbital videos that enhance the immersive shopping experience in e-commerce.”
From Stable Zero123 to SV3D
Stability AI is well-known for its Stable Diffusion text-to-image generative AI models, including SDXL and Stable Diffusion 3.0, the latter currently in early research preview. The open-source Stable Diffusion 1.5 model underpins many AI image generation and video platforms, such as Runway and Leonardo AI.
In December 2023, Stability AI released the Stable Zero123 model, which introduced new capabilities for 3D image creation. Emad Mostaque, founder and CEO of Stability AI, stated that this model was the first in a series focusing on 3D technologies.
SV3D adopts a different approach to 3D generation compared to Stable Zero123.
“Stable Video 3D serves as both a successor and an enhancement of our earlier model, Stable Zero123,” Jampani explained. “This new model employs a novel view synthesis network that generates multiple novel view images from a single input.”
In contrast to Stable Zero123, which relies on Stable Diffusion to output one image at a time, SV3D leverages Stable Video Diffusion models to produce multiple novel views simultaneously, resulting in superior quality and more effective 3D mesh generation from a single image.
Consistent Views from Any Angle
A research paper by Stability AI discusses techniques for generating 3D visuals from a single image through latent video diffusion.
“Recent advancements in 3D generation adapt 2D generative models for novel view synthesis (NVS) and 3D optimization,” the report states. However, many existing methods face challenges with limited perspectives and inconsistent outputs.
SV3D's primary strength lies in its ability to provide consistent multi-view images of an object, offering coherent perspectives from various angles. The research paper emphasizes this advancement, stating, “Unlike prior approaches that struggle with restricted views and inconsistencies, Stable Video 3D provides coherent views from any angle with effective generalization.”
In addition to enhancing view synthesis, SV3D aims to optimize 3D meshes. Its multi-view consistency allows for high-quality 3D mesh generation directly from the outputs produced.
“Stable Video 3D utilizes its multi-view consistency to optimize 3D Neural Radiance Fields (NeRF) and mesh representations, significantly improving the quality of generated 3D meshes,” Stability AI stated in their announcement.
Two Varieties: SV3Du and SV3Dp
SV3D is available in two variants, each catering to distinct use cases.
SV3D_u generates orbital videos from single image inputs without requiring camera conditioning. Camera conditioning involves additional input, often an image or parameters related to camera perspectives, guiding the generation process.
Conversely, SV3D_p supports both single images and orbital views, empowering users to create 3D videos along specified camera paths.