Today, Dubai-based Camb AI, a startup specializing in AI-driven content localization technologies, unveiled Mars5, an advanced AI model for voice cloning.
While many models, such as those from ElevenLabs, can create digital voice replicas, Camb AI sets itself apart with Mars5’s unparalleled realism. According to initial samples from the company, Mars5 not only mimics the original voice but also captures intricate prosodic elements such as rhythm, emotion, and intonation.
Camb AI supports nearly three times as many languages as ElevenLabs, offering over 140 languages—including those less commonly spoken, like Icelandic and Swahili—compared to ElevenLabs’ 36. However, the open-sourced English-specific version is available on GitHub starting today, while the broader language support can be accessed through Camb's paid Studio.
“The level of prosody and realism that Mars5 captures with just a few seconds of input is unprecedented. This marks a groundbreaking moment in speech technology,” said Akshat Prakash, co-founder and CTO.
Integrating Voice Cloning and Text-to-Speech
Traditionally, voice cloning and text-to-speech are separate processes: voice cloning creates a synthetic voice from audio samples, while text-to-speech uses that voice to read text. However, Mars5 integrates both capabilities into a single platform. Users simply upload an audio file—lasting between a few seconds and a minute—and provide the text to be synthesized. The model analyzes the audio to replicate the speaker’s voice, style, emotion, and meaning, transforming the text into natural-sounding speech.
Camb AI claims Mars5 adeptly captures a wide range of emotional tones, addressing complex speech situations such as frustration, command, calmness, or enthusiasm. This versatility makes Mars5 ideal for traditionally challenging content, such as sports commentary, films, and anime.
To achieve this level of prosody, Mars5 combines a Mistral-style ~750M parameter autoregressive model with an innovative ~450M parameter non-autoregressive multinomial diffusion model, using 6kbps encodec tokens.
“The AR model predicts the most basic codebook values for the encodec features, while the NAR model refines these predictions, ‘inpainting’ the remaining codebook entries. This approach employs a denoising diffusion process for enhanced accuracy,” Prakash elaborated.
Performance Compared to Other Models
While specific benchmark statistics are pending, early tests suggest Mars5 outperforms popular speech synthesis models, including Metavoice and ElevenLabs, often producing results that resemble the original voice more closely than its competitors.
“Although ElevenLabs has trained on a significantly larger dataset of over 500K hours, our model design captures the nuances of speech more effectively. As we expand our datasets and further train Mars5—releasing updates on GitHub—we anticipate even greater improvements,” added Prakash.
Camb AI is also preparing to release another open-source model called Boli, designed for translation that understands context, ensures grammatical accuracy, and captures colloquial nuances.
“Boli exceeds traditional translation tools like Google Translate in delivering nuanced, culturally relevant translations, particularly for low- to medium-resource languages. Feedback suggests Boli significantly outperforms mainstream tools, including cutting-edge generative models like ChatGPT,” Prakash stated.
Currently, both Mars5 and Boli support 140 languages on Camb’s proprietary platform, Camb Studio, and the company is offering these capabilities as APIs to enterprises, SMEs, and developers. Camb AI collaborates with Major League Soccer, Tennis Australia, and Maple Leaf Sports & Entertainment, as well as leading film and music studios and various government agencies.
Notably, Camb AI made history by live-dubbing a Major League Soccer game into four languages simultaneously for over two hours, as well as translating the Australian Open’s post-match conference into multiple languages and converting the psychological thriller “Three” from Arabic to Mandarin.