Cohere for AI Unveils Open Source LLM Supporting 101 Languages: Empowering Global AI Communication

Home AI News Cohere for AI Unveils Open Source LLM Supporting 101 Languages: Empowering Global AI Communication

Updated on October 30 2024

Today, Cohere for AI, the nonprofit research lab founded by Cohere in 2022, introduced Aya, an open-source large language model (LLM) that supports 101 languages—more than twice the number offered by existing open-source models.

Accompanying this release is the Aya dataset, which features human annotations essential for training models in less common languages. Cohere for AI's researchers have also developed methods to enhance model performance with limited training data.

Launched in January 2023, the Aya project was a significant effort involving over 3,000 collaborators from 119 countries. Sara Hooker, VP of Research at Cohere and leader of Cohere for AI, remarked that the project turned out to be far more extensive than initially anticipated, boasting over 513 million instruction fine-tuned annotations. This crucial data is considered “gold dust,” vital for refining LLM training beyond the basic data scraped from the internet.

Ivan Zhang, co-founder and CTO of Cohere, shared on X that the team is releasing human demonstrations across 100+ languages to broaden LLM accessibility, ensuring that it serves a global audience rather than just English speakers. He praised this as a remarkable scientific and operational achievement by Hooker and the Cohere for AI team.

Unlocking LLM Potential for Underrepresented Languages and Cultures

According to a blog post from Cohere, the Aya model and dataset aim to help researchers tap into the potential of LLMs for numerous languages and cultures that have been largely overlooked by existing models. Cohere for AI's benchmarks reveal that the Aya model outperforms the best open-source multilingual models, such as mT0 and Bloomz, significantly, while also expanding coverage to over 50 previously unserved languages, including Somali and Uzbek.

Hooker emphasized that models supporting more than six languages are considered “extreme,” and only a handful achieve true “massively multilingual” performance with around 25 languages.

Addressing the Data Deficit Beyond English

Hooker explained that a data “cliff” exists outside the realm of English fine-tuning data, making Aya's dataset exceptionally rare. She believes that researchers will select languages from the dataset to develop models for specific linguistic communities—a crucial need. However, she noted that the primary technical challenge lies in precision, as users worldwide expect personalized models tailored to their languages.

Aleksa Gordic, a former researcher at Google DeepMind and creator of YugoGPT, which outperformed Mistral and Llama 2 for Serbian, Bosnian, Croatian, and Montenegrin, emphasized the importance of multilingual datasets like Aya. He stated that to develop high-quality LLMs for non-English languages, high-quality and abundant data sources are essential.

While he believes the effort is a step in the right direction, Gordic noted that a global research community and government support are necessary to create and maintain large, high-quality data sets to preserve languages and cultures in the evolving AI landscape.

Cohere for AI's Aya model and datasets are now available on Hugging Face.

Unlocking Meeting Insights: Otter.ai’s Innovative 'Meeting GenAI' Transforms How You Capture Intelligence from Conversations

Effective Advertising Strategies: Harnessing Generative AI, Genre-Bending Content, and Gamification | AppLovin

Most people like

AISEO - AI writing assistant, Copywriting & Paraphrasing Tool

AISEO is a powerful writing assistant designed to create optimized content swiftly while boasting an advanced paraphrasing tool. Whether you're crafting blog posts or enhancing your web copy, AISEO streamlines the writing process, ensuring you achieve high-quality results in record time.

AI writing assistant AI Content Generator

ReRoom AI - Photorealistic Renders for Interior Design

ReRoom AI empowers users to effortlessly generate photorealistic renders for their interior design projects using SketchUp. Transform your designs into stunning visualizations that captivate clients and enhance presentations with this powerful tool.

interior design AI Interior & Room Design

Customerly Ai

Discover the top AI chatbot solutions designed specifically for support teams. These innovative tools enhance customer interactions, streamline inquiries, and boost team efficiency, making them essential for modern support operations. Explore the best options available to elevate your customer service experience today.

AI ChatBot AI Customer Service Assistant

Subtxt

Subtxt is a cutting-edge writing tool designed to assist storytellers in crafting captivating narratives. With its intelligent features, Subtxt empowers users to enhance their storytelling skills and engage readers effectively.

outlining tool AI Book Writing

Find AI tools in YBX