In its pursuit of creating AI that comprehends diverse dialects, Meta has developed SeamlessM4T, an AI model capable of translating and transcribing nearly 100 languages across text and speech formats. This groundbreaking model is available as open source alongside SeamlessAlign, a newly introduced translation dataset. Meta characterizes SeamlessM4T as a "significant breakthrough" in AI-enhanced speech-to-speech and speech-to-text translation.
“Our unified model offers on-demand translations, facilitating more effective communication among speakers of different languages,” Meta states in a recent blog post. “SeamlessM4T can automatically recognize source languages without requiring a separate language identification model.”
SeamlessM4T acts as a spiritual successor to Meta’s No Language Left Behind, a text-to-text machine translation initiative, and the Universal Speech Translator, which is among the few systems capable of direct speech-to-speech translation for the Hokkien language. It also builds upon Massively Multilingual Speech, Meta’s framework that encompasses speech recognition, language identification, and speech synthesis technology for over 1,100 languages.
Meta isn't alone in dedicating resources to develop advanced AI translation and transcription tools. With a plethora of commercial services and open-source models already offered by companies like Amazon, Microsoft, OpenAI, and various startups, Google is also working on its Universal Speech Model as part of its broader initiative to understand the world's 1,000 most-spoken languages. Additionally, Mozilla has launched Common Voice, one of the largest multilingual voice datasets for training speech recognition algorithms.
Amidst these efforts, SeamlessM4T stands out as one of the most ambitious attempts to integrate translation and transcription functions into a single model. Meta developed this model by analyzing publicly available text and audio data, reportedly amounting to “tens of billions” of sentences and 4 million hours of speech sourced from the internet. In an interview, Juan Pino, a research scientist from Meta’s AI division, refrained from disclosing specific data sources but emphasized the “variety” involved.
However, not all content creators agree with utilizing public data for training models that serve commercial purposes. Several have initiated lawsuits against companies using AI tools based on publicly available data, claiming these vendors should offer credit or compensation and provide clear opt-out mechanisms.
Meta asserts that the data it collected, while it may include personally identifiable information, was not copyrighted and primarily sourced from open or licensed platforms.
Regardless of the debate, Meta harnessed this scraped text and audio to create the training dataset for SeamlessM4T, named SeamlessAlign. Researchers aligned 443,000 hours of speech with corresponding texts and produced 29,000 hours of “speech-to-speech” alignments, effectively teaching SeamlessM4T to transcribe speech accurately, translate text, generate speech from written words, and translate spoken words across languages.
According to internal evaluations, SeamlessM4T outperformed current leading speech transcription models in handling background noise and speaker variations during speech-to-text tasks. Meta attributes this success to the rich diversity of speech and text data within the training set, suggesting that SeamlessM4T benefits from combining speech and text, unlike speech-only or text-only models.
“With cutting-edge performance, we believe SeamlessM4T marks a vital advancement in the AI community’s journey toward building universal multitasking systems,” Meta noted in the blog post.
Nevertheless, questions arise regarding potential biases ingrained in the model. An article in The Conversation highlights numerous flaws present in AI translation systems, including various forms of gender bias. For example, Google Translate historically assumed that doctors were male while nurses were female in specific languages. Similarly, Bing's translator misrepresented phrases like “the table is soft” with the feminine term “die Tabelle” in German, which refers to a figures table.
Moreover, speech recognition algorithms often reflect biases as well. A study published in The Proceedings of the National Academy of Sciences indicated that leading companies' speech recognition systems were twice as likely to misinterpret audio from Black speakers compared to white speakers.
SeamlessM4T is not exempt from these issues. In a white paper accompanying the blog post, Meta acknowledges that the model tends to “overgeneralize to masculine forms when translating from neutral terms” and performs more effectively when drawing from masculine references (e.g., the English pronoun “he”) in many languages.
Additionally, lacking gender context, SeamlessM4T opts for masculine translations about 10% of the time, potentially due to an “overrepresentation of masculine lexica” within the training data, as per Meta's speculation.
While Meta maintains that SeamlessM4T does not introduce excessive instances of toxic text in its translations, a common issue with AI translation and generative models, it is not without flaws. In certain languages, such as Bengali and Kyrgyz, SeamlessM4T produced more toxic translations—particularly regarding socioeconomic status and culture. It is also more likely to generate toxic translations related to sexual orientation and religion.
Meta indicates that the public demo of SeamlessM4T includes a filter to address toxicity in both inputted and generated speech; however, this filter is not included by default in the open-source version of the model.
A critical concern not addressed in the white paper involves the erosion of lexical richness resulting from excessive reliance on AI translators. Unlike AI, human translators bring individual nuances to their work. They might explicate information, normalize phrases, or condense meaning, creating distinct expressions referred to informally as “translationese.” While AI may produce more “accurate” translations, this could occur at the cost of varied and dynamic interpretations.
Given these considerations, Meta advises against using SeamlessM4T for long-form translations and certified documents recognized by governmental entities and translation authorities. They also recommend refraining from employing the model in medical or legal contexts—presumably as a precaution against potential mistranslations.
This caution is warranted; there have been instances where AI mistranslations led to critical misunderstandings in law enforcement. In September 2012, police mistakenly confronted a Kurdish man for suspected terrorism due to a mistranslated text message. In 2017, a Kansas officer used Google Translate to communicate with a Spanish speaker regarding a car search. Due to inaccuracies in the translation, the driver did not fully comprehend the situation, resulting in the case being dismissed.
“This unified system approach minimizes errors and delays, enhancing the efficiency and quality of the translation process, and brings us closer to achieving seamless communication,” Pino remarked. “Looking ahead, we aim to explore how this foundational model can unlock new communication possibilities—ultimately fostering a world where everyone can be understood.”
Let’s hope that in this envisioned future, the human touch is not entirely overlooked.