Today marks a significant advancement toward a future where celebrity voices could be immortalized in technology. Meta has launched Voicebox, a groundbreaking generative text-to-speech model designed to transform audio generation similarly to how ChatGPT and Dall-E revolutionized text and image creation. Rather than producing text or images, Voicebox generates high-quality audio clips.
Meta describes Voicebox as “a non-autoregressive flow-matching model trained to infill speech, given audio context and text.” It has been trained on over 50,000 hours of unfiltered audio data, including recordings and transcripts from various public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese. This diverse dataset enables the model to produce natural-sounding speech, enhancing conversational quality across different languages.
Research indicates that speech recognition models trained on Voicebox-generated synthetic speech perform nearly as well as those trained on real speech. Additionally, the degradation in performance for the generated speech is only 1 percent, compared to the 45 to 70 percent decline often seen with existing text-to-speech (TTS) models.
Voicebox excels in its ability to predict and infill speech segments based on surrounding audio context and transcripts. This capability allows it to seamlessly generate audio portions within existing recordings without requiring a complete re-recording. Moreover, it can actively edit audio clips by removing background noise and correcting mispronounced words. Users can identify and crop noisy segments, instructing the model to regenerate those parts—similar to how photo-editing software enhances images.
While text-to-speech generators have existed for some time—enabling creations like GPS navigation voices—modern solutions such as Speechify and ElevenLab’s Prime Voice AI typically demand extensive source material for accurate voice mimicry. Voicebox’s innovative zero-shot text-to-speech training method, known as Flow Matching, sets it apart by eliminating this need for vast data sets for each voice.
Benchmark results highlight Voicebox's superiority, outperforming current industry standards in intelligibility (with a word error rate of 1.9 percent versus 5.9 percent) and audio similarity (composite score of 0.681 compared to 0.580). Impressively, it operates up to 20 times faster than today's leading TTS systems.
However, it’s important to note that the Voicebox app and its source code will not be publicly released at this time, as Meta has expressed concerns over potential misuse, despite recognizing the promising applications of generative speech models.