Startups like ElevenLabs have invested millions in developing proprietary algorithms and AI software for voice cloning, which creates audio programs that replicate users' voices.
Now, researchers from the Massachusetts Institute of Technology (MIT), Tsinghua University in Beijing, and members of AI startup MyShell introduce OpenVoice—an open-source voice cloning solution that boasts nearly instantaneous results and offers granular controls not found in other platforms.
“Clone voices with unparalleled precision, adjusting tone, emotion, accent, rhythm, pauses, and intonation from just a small audio clip,” states MyShell in their recent post on X.
The company shared a link to their research paper detailing the development of OpenVoice, along with access points for users to try it: the MyShell web app (user account required) and HuggingFace (public access without an account).
In an email, lead researcher Zengyi Qin from MIT and MyShell emphasized the project's goal: "MyShell aims to benefit the research community. OpenVoice is just the beginning. In the future, we will provide grants, datasets, and computing power to support open-source research. Our core mission is ‘AI for All.’”
Regarding the motivation behind OpenVoice, Qin explained: “Language, vision, and voice are three key modalities for future Artificial General Intelligence (AGI). While there are various open-source models for language and vision, a powerful, instant voice cloning model for customization was lacking, which is why we undertook this project.”
Using OpenVoice
In informal tests using HuggingFace, I quickly generated a convincing—if somewhat robotic—replica of my voice using random speech. Unlike other voice cloning applications, OpenVoice allowed me to speak freely without adhering to a specific script. In mere seconds, I had a voice clone that accurately read back my text prompt.
Additionally, I could adjust the "style" of the clone among different emotional presets, such as cheerful, sad, or angry, effectively changing the tone.
Here’s a sample of my voice clone using OpenVoice set to a "friendly" tone.
How OpenVoice was Created
The creators of OpenVoice—Qin, Wenliang Zhao and Xumin Yu from Tsinghua University, and Xin Sun from MyShell—outlined their method in their research paper. OpenVoice consists of two key AI models: a text-to-speech (TTS) model and a tone converter.
The TTS model manages style parameters and languages, trained on 30,000 sentences from two English speakers (with American and British accents), one Chinese speaker, and one Japanese speaker, each labeled with specific emotions. It learned nuances like intonation, rhythm, and pauses.
The tone converter was trained on over 300,000 audio samples from more than 20,000 speakers. Audio from spoken language is converted into phonemes—distinct sounds that differentiate words—and represented as vector embeddings.
By utilizing a "base speaker" for the TTS model, in combination with tone information from user input, these models can replicate the user’s voice and adapt its emotional expression. The diagram in the OpenVoice research illustrates how these models integrate.
Despite the conceptual simplicity, this method is efficient and requires significantly fewer computing resources than competitors like Meta's Voicebox.
Qin shared, “We aimed to develop the most flexible instant voice cloning model. This flexibility means control over styles, emotions, accents, and adaptability to any language. Previously, such comprehensive functionality was unattainable due to its complexity. Through a decoupled pipeline process, we achieved effective outcomes with simplicity.”
Behind OpenVoice
MyShell, established in 2023 with a $5.6 million seed round led by INCE Capital alongside contributions from Folius Ventures, Hashkey Capital, SevenX Ventures, TSVC, and OP Crypto, has already garnered over 400,000 users, as reported by The SaaS News. While researching, I observed over 61,000 users on their Discord server.
MyShell describes itself as a “decentralized and comprehensive platform for discovering, creating, and staking AI-native applications.” Besides OpenVoice, their web app features various text-based AI characters and bots with distinct personalities, akin to Character.AI, and includes tools such as an animated GIF maker and user-generated RPGs based on popular franchises.
As for monetization, MyShell charges a monthly subscription for web app users and for third-party bot creators wishing to promote their products within the app. They also charge for AI training data.
Correction: Thursday, January 4, 2023 – The piece was updated to clarify that MyShell is not based in Calgary, AB, Canada.