Facebook's innovative machine translation (MT) system now enables direct translation between 100 languages, eliminating the need for English as an intermediary. This advancement, known as M2M-100, facilitates seamless translation from languages like Chinese to French without sacrificing accuracy. Traditional systems often convert through English, but this process can complicate translations and reduce overall precision.
Angela Fan, a Facebook AI research associate, emphasized the importance of catering to the diverse linguistic needs of users worldwide. With two-thirds of daily posts on Facebook’s platform originating in languages other than English, the demand for a more effective translation solution is clear.
The M2M-100 model leverages a vast data set of 7.5 billion sentences across 100 languages, trained with over 15 billion parameters. This universal model captures nuances and relationships between related languages, enhancing translation quality. The initiative builds on years of research and data collection from various sources.
To gather data, Facebook utilized CommonCrawl, a repository of web crawl data, alongside FastText for language classification. This method allows the team to categorize large text volumes by language, ultimately identifying pairs of sentences for translation. Traditional approaches involving human translators are often impractical due to the complexity of finding bilingual individuals for less common language pairs.
To efficiently create translation data on a large scale, Fan's team applied the LASER system, which generates mathematical representations of sentences. This enables alignment between similar sentences in different languages, facilitating accurate translation mapping.
In cases where written content is scarce for certain languages, the team incorporated monolingual data. For instance, with translations from Chinese to French, they utilized high-quality French content to back-translate into Chinese. This process produces synthetic data that enhances the training model's accuracy.
While the M2M-100 model lays a strong foundation for language translation, challenges remain for low-resource languages. Fan points out that while progress has been made with languages like Swahili and Afrikaans, more work is needed for languages such as Zulu.
Facebook plans to release the M2M-100 data set, model, training methodologies, and evaluation setups as open source, fostering further advancements in translation technology. The company aims to integrate this innovative system into its daily operations, contributing to a more connected global community.