Open-Sora 1.0: The World’s First Open-Source Video Generation Model Inspired by Sora, Just Launched in China

Home AI News Open-Sora 1.0: The World’s First Open-Source Video Generation Model Inspired by Sora, Just Launched in China

Open-Sora 1.0: A Groundbreaking Open-Source Video Generation Model from China

The newly launched Open-Sora 1.0 is a fully open-source video generation project that offers complete access to training details and model weights. The cost to reproduce its results with 64 GPUs has decreased to $10,000, reflecting a 46% reduction. OpenAI's Sora has recently gained significant recognition for its impressive video generation capabilities, distinguishing itself in the global landscape of AI models.

With the introduction of a more affordable training and inference protocol by the Colossal-AI team, Open-Sora 1.0 stands as the first video generation model built on a Sora-like architecture. This open-source initiative encompasses the entire training pipeline, from data processing to detailed model weights, aiming to inspire a new wave of video creation among AI enthusiasts globally.

Demonstrating Open-Sora 1.0's Capabilities

To showcase the power of Open-Sora, the Colossal-AI team has released an eye-catching video that features dynamically generated urban landscapes. This serves as a glimpse into the innovative potential of Sora's video reproduction technology. The project also offers extensive resources, including architectural specifications, trained model weights, data preprocessing steps, demo displays, and user-friendly tutorials—all freely accessible on GitHub.

Exploring the Sora Replication Strategy

In this section, we highlight key components of the Sora replication strategy, including model architecture, training processes, data preprocessing, and generation effectiveness.

Model Architecture

Open-Sora employs the advanced Diffusion Transformer (DiT) architecture. Built upon the high-quality open-source PixArt-α text-to-image model, the team has integrated a temporal attention layer to enhance its video data capabilities. The architecture includes a pretrained Variational Autoencoder (VAE), a text encoder, and a Spatial Temporal Diffusion Transformer (STDiT) model to effectively capture temporal relationships.

Overview of the Training Process

The training and inference process is divided into several stages: Initially, a pretrained VAE encoder compresses video data, which is then trained alongside text embeddings in the STDiT diffusion model's latent space. During inference, Gaussian noise from the latent space is combined with prompt embeddings to generate denoised features, which are subsequently decoded into video formats.

The replication strategy encompasses three primary phases:

1. Large-scale Image Pretraining: Leveraging existing text-to-image models to reduce video pretraining costs.

2. Large-scale Video Pretraining: Enhancing the model's generalization ability by deciphering temporal correlations in video data.

3. High-Quality Video Fine-tuning: Refining the model using lengthy, high-quality video datasets to significantly elevate output quality.

For training, the team utilized 64 H800 GPUs, resulting in approximate costs of $7,000 for the second phase and $4,500 for the third, totaling around $10,000.

Innovations in Data Preprocessing

To facilitate the Sora replication process, the Colossal-AI team has developed user-friendly preprocessing scripts. These tools enable seamless video pretraining, including downloading public video datasets and segmenting longer videos into shorter clips. The scripts also incorporate functionality for generating video titles using a large language model, significantly lowering entry barriers for project initiation.

Practical Applications of Video Generation

Open-Sora has demonstrated its capabilities by generating various video scenarios. Examples include aerial views of waves crashing against cliffs, majestic waterfalls, and serene underwater scenes of turtles gliding through coral reefs. The model also produced breathtaking time-lapse footage of a star-studded sky. For further creative possibilities, connect with the Open-Sora community for access to free model weights.

Future Enhancements and Efficiency Improvements

While Open-Sora 1.0 currently operates on 400K training samples—leading to minor inaccuracies like an extra turtle limb—the team is committed to enhancing the model's performance and output quality.

Colossal-AI also provides an acceleration system featuring operator optimization and mixed parallelization strategies to boost training efficiency. Notably, the team achieved a 1.55x speed improvement during training with 64 frames of 512x512 videos, underlining the model's capacity for processing extensive video sequences.

For ongoing updates and advancements in the Open-Sora project, visit their GitHub page. The team intends to continually refine the model by integrating more diverse video data, enhancing output quality and supporting multiple resolutions, paving the way for AI applications in film, gaming, and advertising.

Accelerating AIGC Adoption: Major Breakthroughs in China's Large Model Development!

Apple Fully Embraces Generative AI: Launching a 30 Billion Parameter Multimodal Model

Most people like

Babble AI

14.6K

Babble AI harnesses the power of Chat GPT to develop intelligent chatbots, facilitating seamless and natural interactions that enhance customer engagement.

chatbot AI Chatbot

IDWise

IDWise is an innovative AI-powered identity verification solution designed to assist businesses in seamlessly authenticating customer identities. With advanced technology, IDWise enhances security and builds trust, making identity verification efficient and reliable.

identity verification AI Product Description Generator

StoryNest.ai

523.6K

Craft captivating AI stories, novels, blogs, and educational content effortlessly. Generate innovative plots, build immersive worlds, and create engaging posts on any topic.

AI-powered AI Story Writing

AI Perfect Assistant

21.6K

Enhance your productivity in Microsoft Suite with an AI-powered assistant designed to streamline your workflow and optimize efficiency.

AI-powered assistant AI WORD

Find AI tools in YBX