As OpenAI welcomes back Sam Altman, its competitors are ramping up their efforts in the artificial intelligence (AI) arena. Following the release of Anthropic's Claude 2.1 and Adobe's acquisition of Rephrase.ai, Stability AI has announced Stable Video Diffusion, marking its entry into the increasingly popular video generation domain.
Introducing Stable Video Diffusion
Stable Video Diffusion (SVD), available for research only, comprises two advanced AI models—SVD and SVD-XT—that generate short video clips from still images. Stability AI claims these models produce high-quality outputs that can compete with or even surpass existing AI video generators.
Both models are open-sourced as part of the research preview, with plans to incorporate user feedback to enhance functionality for future commercial applications.
Understanding Stable Video Diffusion
According to Stability AI's blog post, SVD and SVD-XT are latent diffusion models that accept a single still image to generate 576 x 1024 video clips. They can produce content at speeds ranging from three to 30 frames per second, though the clips are limited to four seconds. The SVD model generates 14 frames from a still, while the SVD-XT model can create up to 25 frames.
To develop Stable Video Diffusion, Stability AI trained their base model on approximately 600 million samples from a curated video dataset, followed by fine-tuning on a smaller, high-quality dataset containing up to one million clips. This training enables the models to perform tasks such as text-to-video and image-to-video generation.
While the training data was sourced from publicly available research datasets, the exact origins remain unspecified.
Importantly, the whitepaper on SVD indicates that this model can be further fine-tuned to support multi-view synthesis, allowing for consistent views of an object from a single image.
The potential applications for Stable Video Diffusion span various sectors, including advertising, education, and entertainment.
Output Quality and Limitations
In external evaluations, SVD outputs have demonstrated high quality, outperforming leading closed text-to-video models from Runway and Pika Labs. However, Stability AI acknowledges that these models are still in their early stages; they frequently struggle with photorealism, may produce videos lacking motion, and often do not generate faces or people as accurately as expected.
Moving forward, the company aims to refine both models, address current limitations, and introduce new features like text prompt support and text rendering for commercial use. They emphasize that this release serves as an invitation for open investigation to identify and resolve issues, including potential biases, to ensure safe deployment.
Stability AI envisions a variety of models built on this foundation, akin to the ecosystem surrounding stable diffusion. They are also inviting users to sign up for an upcoming web experience that will enable text-to-video generation, although the exact timeline for its availability remains unclear.
How to Use the Models
To explore the Stable Video Diffusion models, users can access the code on Stability AI's GitHub repository and the necessary weights for local model execution on its Hugging Face page. Usage is permitted only upon acceptance of terms that outline allowed and excluded applications.
Currently, permissible use cases include generating artwork for design and educational or creative tools. However, generating factual representations of people or events is outside the scope of this project, according to Stability AI.