Stability AI has released an early preview of its next-generation text-to-image generative AI model, Stable Diffusion 3.0. This update follows a year of iterative enhancements, showcasing increasing sophistication and quality in image generation. The previous SDXL release in July significantly upgraded the base model, and now the company aims for even greater advancements.
Stable Diffusion 3.0 focuses on enhanced image quality and performance, particularly in generating images from multi-subject prompts. One notable improvement is in typography, addressing a previous weakness by delivering accurate and consistent spelling within generated images. These improvements are crucial, as competitors like DALL-E 3, Ideogram, and Midjourney have also prioritized this feature in their recent updates. Stability AI is offering Stable Diffusion 3.0 in various model sizes, ranging from 800M to 8B parameters.
This update marks a significant shift—not merely an enhancement of previous models, but a complete overhaul based on a new architecture. "Stable Diffusion 3 is a diffusion transformer, a new architecture akin to that used in OpenAI’s recent Sora model," stated Emad Mostaque, CEO of Stability AI. “It is the true successor to the original Stable Diffusion.”
The transition to diffusion transformers and flow matching heralds a new era in image generation. Stability AI has experimented with various techniques, recently previewing Stable Cascade, which utilizes the Würstchen architecture to boost performance and accuracy. In contrast, Stable Diffusion 3.0 employs diffusion transformers, a significant shift from its predecessor.
Mostaque explained, “Stable Diffusion did not have a transformer before.” This architecture, foundational to many generative AI advancements, has largely been reserved for text models, while diffusion models dominated image generation. The introduction of Diffusion Transformers (DiTs) optimizes the use of computational resources and enhances performance by replacing the traditional U-Net backbone with transformers operating on latent image patches.
Additionally, Stable Diffusion 3.0 benefits from flow matching, a novel training method for Continuous Normalizing Flows (CNFs) that effectively models complex data distributions. Researchers indicate that employing Conditional Flow Matching (CFM) with optimal transport paths results in faster training, more efficient sampling, and improved performance compared to conventional diffusion methods.
The model demonstrates clear progress in typography, allowing for more coherent narratives and stylistic choices within generated images. “This improvement is due to both the transformer architecture and additional text encoders,” Mostaque noted. “Full sentences are now possible, as is coherent style.”
While Stable Diffusion 3.0 is initially showcased as a text-to-image AI, it serves as the foundation for future innovations. Stability AI plans to expand into 3D and video generation capabilities in the coming months. “We create open models that can be utilized and adapted for various needs,” Mostaque concluded. “This series of models across sizes will underpin the development of our next-generation visual solutions, including video, 3D, and more.”