Stability AI is advancing its vision for generative AI with the launch of the Stable Audio 2.0 model.
While the company is widely recognized for its text-to-image Stable Diffusion models, it’s expanding its portfolio. Stable Audio initially debuted in September 2023, allowing users to create short audio clips based on text prompts. With Stable Audio 2.0, users can now generate high-quality audio tracks of up to three minutes—double the length of the original 90 seconds.
In addition to text-to-audio generation, Stable Audio 2.0 introduces audio-to-audio capabilities, enabling users to upload samples and use them as prompts. The model is currently available for limited free use on the Stable Audio website, with API access coming soon for developers looking to build innovative services.
The release of Stable Audio 2.0 marks Stability AI's first major update since the abrupt resignation of former CEO and founder Emad Mostaque in March. The company reassures users that the update signifies ongoing business operations.
Improvements from Stable Audio 1.0 to 2.0
The development of Stable Audio 2.0 has drawn valuable insights from its predecessor, Stable Audio 1.0. Zach Evans, head of audio research at Stability AI, noted that the focus during the initial release was to launch a groundbreaking model with superior audio fidelity and meaningful output duration.
“Since then, we’ve focused on enhancing musicality, extending output duration, and improving responsiveness to detailed prompts,” Evans stated. “These enhancements aim to make the technology more applicable in real-world scenarios.”
Stable Audio 2.0 can now produce full musical tracks featuring coherent structures. Utilizing latent diffusion technology, the model can generate compositions lasting up to three minutes, complete with distinct intro, development, and outro sections—a significant upgrade from its earlier ability to create only short loops or fragments.
The Technology Behind Stable Audio 2.0
Stable Audio 2.0 continues to leverage a latent diffusion model (LDM). Following the December 2023 beta release of Stable Audio 1.1, the model incorporated a transformer backbone, resulting in a “diffusion transformer” architecture.
“We enhanced the data compression applied to audio during training, allowing us to scale outputs up to three minutes and beyond while maintaining efficient inference times,” Evans added.
Enhanced Creative Capabilities
With Stable Audio 2.0, users can generate audio not only from text prompts but also from uploaded audio samples. Natural language instructions can be used to creatively transform these sounds, enabling iterative refining and editing processes.
The model also broadens the spectrum of sound effects and textures. Users can now prompt it to create immersive surroundings, ambient sounds, crowds, cityscapes, and more. Additionally, it allows modifications to the style and tone of both generated and uploaded audio.
Addressing Copyright Concerns in Generative AI Audio
Copyright considerations remain a significant issue in the generative AI space. Stability AI is committed to upholding intellectual property rights with its new audio model. To alleviate copyright concerns, Stable Audio 2.0 has been exclusively trained on licensed data from AudioSparx, and it respects opt-out requests. Content recognition technology monitors audio uploads to prevent the processing of copyrighted material.
Safeguarding copyright is essential for Stability AI to successfully commercialize Stable Audio and ensure safe usage for organizations. Currently, Stable Audio generates revenue through subscriptions to its web application, with an API set to launch soon.
However, Stable Audio is not an open model at this time. “The weights for Stable Audio 2.0 will not be available for download, but we are developing open audio models for release later this year,” Evans confirmed.