How Diffusion Transformers Power OpenAI's Sora and Will Revolutionize Generative AI

OpenAI's Sora, capable of crafting videos and interactive 3D environments in real-time, stands as a groundbreaking example of advancements in generative AI (GenAI) technology – a truly remarkable milestone. Interestingly, the innovative architecture that enabled this achievement, known as the diffusion transformer, has been part of the AI research community for several years.

The diffusion transformer, which also enhances Stability AI’s latest image generator, Stable Diffusion 3.0, is set to revolutionize the GenAI landscape by allowing models to scale to unprecedented levels. Initiated by NYU computer science professor Saining Xie in June 2022, the research that produced the diffusion transformer stemmed from a collaboration with his mentee, William Peebles, who is now the co-lead for Sora at OpenAI. Together, they fused two key machine learning concepts—diffusion and transformers—to form the diffusion transformer.

Modern AI-driven media generators, like OpenAI’s DALL-E 3, utilize a process called diffusion to produce a variety of outputs, including images, videos, audio, music, and 3D models. Although it may sound complex, the basic premise involves gradually adding noise to a piece of content—such as an image—until it becomes unrecognizable. This step is repeated to create a dataset composed of noisy media. As a diffusion model trains on this data, it learns to incrementally remove the noise, refining its output toward a desired media result, like generating a new image.

Typically, diffusion models operate with a “backbone,” often a U-Net, that learns how to estimate the noise that must be subtracted—effectively but with complexity that can slow down the diffusion process. Fortunately, transformers can serve as replacements for U-Nets, yielding significant performance and efficiency gains.

Transformers are well-suited for intricate reasoning tasks and power advanced models such as GPT-4, Gemini, and ChatGPT. They boast unique features, particularly their “attention mechanism.” For any input data (like image noise), transformers assess the relevance of all other inputs (other noise components) to generate an output (an estimation of the noise).

This attention mechanism not only simplifies the structure of transformers compared to other models but also enables parallel processing. This means larger transformer models can be trained with manageable increases in computational resources.

“What transformers bring to the diffusion process is comparable to an engine upgrade,” Xie explained in an email interview. “The integration of transformers signifies a notable advancement in scalability and effectiveness. This is particularly visible in models like Sora, which benefits from training on large volumes of video data and extensive model parameters to fully exploit the transformative power of transformers at scale.”

So, why has it taken so long for diffusion transformers to gain traction in projects like Sora and Stable Diffusion? Xie believes that the significance of having a scalable backbone model has only recently become apparent. “The Sora team has truly excelled in demonstrating the vast possibilities that this scalable approach offers,” he said. “They’ve made it abundantly clear that U-Nets are a thing of the past, while transformers are the future for diffusion models.”

According to Xie, integrating diffusion transformers with existing models—whether for image, video, audio, or other media—is feasible. Though the current training process for diffusion transformers may present some inefficiencies and potential performance drawbacks, he believes these challenges can be overcome in time. “The essential takeaway is simple: move away from U-Nets and adopt transformers, as they are faster, more efficient, and highly scalable,” he noted. “I’m eager to explore the convergence of content understanding and creation within the diffusion transformer framework. Currently, these domains operate separately—one focused on understanding and the other on creation. I envision a future where these areas integrate, necessitating standardized architectures, with transformers being the ideal foundation for such integration.”

If Sora and Stable Diffusion 3.0 are any indication of the potential of diffusion transformers, we’re certainly in for an exciting future.

Most people like

Find AI tools in YBX