Voice Cloning: The Future of AI Audio Generation
Voice cloning is a rapidly developing field within generative AI, involving the replication of a person's vocal characteristics—such as pitch, timbre, rhythm, mannerisms, and unique pronunciations—using advanced technology. Startups like ElevenLabs have attracted significant funding for this purpose, while Meta Platforms, the parent company of Facebook, Instagram, WhatsApp, and Oculus VR, has introduced its own free voice cloning tool called Audiobox, albeit with some limitations.
Introducing Audiobox
Unveiled by researchers at the Facebook AI Research (FAIR) lab, Audiobox is described as a "foundation research model for audio generation," building on prior work with Voicebox. According to the Audiobox webpage, "It can generate voices and sound effects using a combination of voice inputs and natural language text prompts, making it easy to create custom audio for various use cases."
Users can simply type a sentence for a cloned voice to say or describe a sound they wish to generate. Alternatively, they can record their own voice and have it cloned by Audiobox.
A Family of Audio-Generating Models
Meta has developed a “family of models,” including one for speech mimicry and another for ambient sound effects like dogs barking or sirens, all constructed on the shared self-supervised learning (SSL) model, Audiobox SSL.
Self-supervised learning is a deep learning technique where AI algorithms generate their own labels for unlabeled data, unlike supervised learning that relies on pre-labeled data. The researchers' paper explains their SSL approach, emphasizing that "labeled data are not always available or high quality; hence, our strategy is to train using audio without supervision, such as transcripts or captions."
Leading generative AI models, including Audiobox, often depend on human-generated data for training. In this case, the FAIR researchers utilized "160K hours of speech (primarily English), 20K hours of music, and 6K hours of sound samples." The speech data encompasses audiobooks, podcasts, conversations, and recordings in various acoustic environments, including speakers from over 150 countries and more than 200 primary languages.
While the research paper doesn't specify the sources of this data, it raises an important issue: content creators and rights holders have raised concerns over AI companies training models using potentially copyrighted material without proper consent. Meta stated in an email that "Audiobox was trained on publicly available and licensed datasets," but did not disclose specific sources.
Try Audiobox Yourself
Meta provides interactive demos showcasing Audiobox's capabilities, allowing users to record their voice, generate a cloned voice, and then input text for that voice to speak. In my experience, the resulting audio was strikingly similar to my own voice—confirmed by family members who heard it without knowing its origin.
Users can also create entirely new voices based on text descriptions like "deep feminine voice" or "high-pitched masculine speaker from the U.S.," and generate various sounds, such as dog barks. I tested this feature with "dogs barking" and received two convincing results.
However, there is a significant catch: a disclaimer states that “this is a research demo and may not be used for any commercial purposes.” Moreover, it is limited to users outside Illinois and Texas due to state laws governing audio collection.
Future of Audiobox and AI Audio Generation
Unlike its recent Imagine by Meta AI image generation tool, Audiobox is not open source, diverging from Meta’s previously established commitment to openness, as seen with the Llama 2 family of large language models (LLMs). A Meta spokesperson indicated that they plan to invite researchers and academic institutions to apply for grants aimed at safety and responsibility research with Audiobox.
Currently, Audiobox cannot be used for commercial purposes, nor is it available to residents of two of the U.S.'s most populous states. However, as AI technology continues to evolve rapidly, we can anticipate the emergence of commercial versions—regardless of whether they come from Meta or other developers.