Meta Releases Open Source Framework for Sound and Music Generation

The day is nearing when generative AI will not only write texts and create images in a convincingly human-like manner, but also compose music and sounds that rival professional quality.

This morning, Meta unveiled AudioCraft, a framework designed to generate what it terms as “high-quality” and “realistic” audio and music from concise text prompts. While this isn't Meta's first venture into audio generation—having open-sourced the AI-driven music generator MusicGen in June—it claims to have made significant advancements that enhance the quality of generated sounds, such as barking dogs, honking cars, and footsteps on wooden floors.

In a blog post shared with the public, Meta explains that AudioCraft aims to streamline the use of generative audio models in comparison to previous technologies in the field (including Riffusion, Dance Diffusion, and OpenAI’s Jukebox). The open-source code for AudioCraft includes a suite of sound and music generators, along with compression algorithms that facilitate the creation and encoding of audio without the hassle of navigating multiple codebases.

AudioCraft encompasses three generative AI models: MusicGen, AudioGen, and EnCodec.

While MusicGen isn't new, Meta has now released its training code, allowing users to adapt the model using their own music datasets. This raises important ethical and legal considerations since MusicGen “learns” from existing music to produce similar styles, a fact that doesn't sit well with all artists or users of generative AI.

More and more homemade tracks utilizing generative AI to evoke familiar sounds are going viral. Music labels have been quick to address these tracks with streaming platforms, citing intellectual property violations, and have generally found success. However, there remains ambiguity surrounding whether “deepfake” music infringes on the copyrights of artists, labels, and other rights holders.

Meta clarifies that the pretrained, ready-to-use version of MusicGen was developed using “Meta-owned and specifically licensed music,” comprising 20,000 hours of audio—400,000 recordings paired with text descriptions and metadata—from the Meta Music Initiative Sound Collection, Shutterstock’s music library, and Pond5, a significant stock media resource. Additionally, vocals were removed from the training data to prevent the model from mimicking artists' voices. Although the terms of use for MusicGen advise against “out-of-scope” applications beyond research, Meta does not explicitly prohibit commercial use.

The second model in AudioCraft, AudioGen, specializes in generating environmental sounds and sound effects rather than music. Similar to many contemporary image generators (like OpenAI’s DALL-E 2, Google’s Imagen, and Stable Diffusion), AudioGen follows a diffusion-based approach, gradually refining starting data comprised exclusively of noise—be it audio or images—into the target prompt step by step.

Given a text description of an acoustic scene, AudioGen claims to create environmental sounds with “realistic recording conditions” and “complex scene content.” However, we weren't granted the opportunity to test AudioGen or listen to its samples before the model's official launch. A white paper published simultaneously with AudioGen indicates that it can generate speech from prompts in addition to music, leveraging its diverse training data.

In the whitepaper, Meta acknowledges that AudioCraft could be misused to deepfake a person's voice. Similar to MusicGen, the ethical concerns surrounding AudioCraft are pronounced. Yet, like MusicGen, Meta is not imposing stringent restrictions on how AudioCraft—or its training code—can be utilized, for better or worse.

The last model in AudioCraft, EnCodec, represents an advancement over a prior Meta model for music generation that reduces artifacts. Meta asserts that it models audio sequences more efficiently, capturing various levels of information from training data audio waveforms to create novel sounds.

“EnCodec is a lossy neural codec specifically trained to compress all types of audio while reconstructing the original signal with high fidelity,” Meta explains in the blog post. “By utilizing different streams to capture various levels of audio waveform information, it allows for high-fidelity reconstruction.”

So, what conclusions can we draw regarding AudioCraft? Meta highlights its potential benefits, such as inspiring musicians and facilitating innovative composition processes. However, as we’ve seen with the emergence of image and text generators, potential drawbacks—and likely legal challenges—linger in the background.

Despite these risks, Meta intends to continue exploring improved controllability and performance enhancements for generative audio models, as well as addressing the limitations and biases inherent in such technologies. Notably, MusicGen reportedly underperforms on descriptions in languages other than English and cultural styles outside the Western canon, owing to clear biases in its training data.

“Rather than obscuring our work as a black box, we believe in transparency regarding the development of these models, ensuring that they're accessible for researchers and the music community alike. This approach helps users understand the capabilities and limitations of the technology so they can empower themselves to utilize it effectively,” Meta writes in the blog post. “With advancements in control mechanisms, we hope these models can be valuable for both amateur and professional musicians.”

Most people like

Find AI tools in YBX