Today, Cohere for AI, the nonprofit research lab founded by Cohere in 2022, introduced Aya, an open-source large language model (LLM) that supports 101 languages—more than twice the number offered by existing open-source models.
Accompanying this release is the Aya dataset, which features human annotations essential for training models in less common languages. Cohere for AI's researchers have also developed methods to enhance model performance with limited training data.
Launched in January 2023, the Aya project was a significant effort involving over 3,000 collaborators from 119 countries. Sara Hooker, VP of Research at Cohere and leader of Cohere for AI, remarked that the project turned out to be far more extensive than initially anticipated, boasting over 513 million instruction fine-tuned annotations. This crucial data is considered “gold dust,” vital for refining LLM training beyond the basic data scraped from the internet.
Ivan Zhang, co-founder and CTO of Cohere, shared on X that the team is releasing human demonstrations across 100+ languages to broaden LLM accessibility, ensuring that it serves a global audience rather than just English speakers. He praised this as a remarkable scientific and operational achievement by Hooker and the Cohere for AI team.
Unlocking LLM Potential for Underrepresented Languages and Cultures
According to a blog post from Cohere, the Aya model and dataset aim to help researchers tap into the potential of LLMs for numerous languages and cultures that have been largely overlooked by existing models. Cohere for AI's benchmarks reveal that the Aya model outperforms the best open-source multilingual models, such as mT0 and Bloomz, significantly, while also expanding coverage to over 50 previously unserved languages, including Somali and Uzbek.
Hooker emphasized that models supporting more than six languages are considered “extreme,” and only a handful achieve true “massively multilingual” performance with around 25 languages.
Addressing the Data Deficit Beyond English
Hooker explained that a data “cliff” exists outside the realm of English fine-tuning data, making Aya's dataset exceptionally rare. She believes that researchers will select languages from the dataset to develop models for specific linguistic communities—a crucial need. However, she noted that the primary technical challenge lies in precision, as users worldwide expect personalized models tailored to their languages.
Aleksa Gordic, a former researcher at Google DeepMind and creator of YugoGPT, which outperformed Mistral and Llama 2 for Serbian, Bosnian, Croatian, and Montenegrin, emphasized the importance of multilingual datasets like Aya. He stated that to develop high-quality LLMs for non-English languages, high-quality and abundant data sources are essential.
While he believes the effort is a step in the right direction, Gordic noted that a global research community and government support are necessary to create and maintain large, high-quality data sets to preserve languages and cultures in the evolving AI landscape.
Cohere for AI's Aya model and datasets are now available on Hugging Face.