AI2 Releases Open Source Text-Generating AI Models Along with Training Data

Home AI News AI2 Releases Open Source Text-Generating AI Models Along with Training Data

Updated on October 22 2024

The Allen Institute for AI (AI2), the nonprofit research organization established by the late Microsoft co-founder Paul Allen, is set to launch multiple GenAI language models that it claims are notably more “open” than existing options. These models are licensed in a way that allows developers to use them freely for training, experimentation, and even commercial applications.

Named OLMo, which stands for “Open Language Models,” these models were developed alongside Dolma—one of the largest public datasets for this purpose. AI2 senior software engineer Dirk Groeneveld explains that these resources are intended to advance our understanding of the complex science behind text-generating AI.

“‘Open’ is an ambiguous term in the context of text-generating models,” Groeneveld shared in an email interview. “We anticipate that researchers and practitioners will embrace the OLMo framework, which includes a model trained on one of the most extensive public datasets released to date, along with all the necessary components for model building.”

With numerous organizations, from Meta to Mistral, releasing capable open-sourced text-generating models for developer use and fine-tuning, Groeneveld argues that many of these offerings do not genuinely qualify as “open.” He emphasizes that they were developed under private conditions and leveraged proprietary datasets.

In a distinct approach, the OLMo models were crafted in collaboration with partners including Harvard, AMD, and Databricks. They include the source code used for compiling their training data, alongside training and evaluation metrics.

When it comes to performance, the leading OLMo model, OLMo 7B, offers a “compelling and strong” alternative to Meta’s Llama 2, depending on the use case. In benchmarks focusing on reading comprehension, OLMo 7B surpasses Llama 2. However, in question-answering assessments, it falls slightly short.

The OLMo models do face limitations, such as subpar performance in languages other than English—since Dolma predominantly features English content—and modest code-generating capabilities. However, Groeneveld emphasizes that this is just the beginning.

“OLMo is not designed to be multilingual—yet,” he noted. “[Currently], the primary focus was not code generation; to help future projects centered on code fine-tuning, OLMo’s dataset does contain about 15% code.”

I inquired about Groeneveld’s concerns regarding potential misuse of the OLMo models, particularly given that they can be deployed commercially and are powerful enough to operate on consumer GPUs like the Nvidia 3090. A recent study by Democracy Reporting International’s Disinfo Radar project highlighted that two popular open text-generating models—Hugging Face’s Zephyr and Databricks’ Dolly—tend to generate harmful content when prompted with malicious requests.

Groeneveld maintains that the advantages of open models outweigh the potential dangers.

“Establishing this open platform will actually enhance research into the risks of these models and identify solutions,” he explained. “While there’s a risk that open models may be misused, this approach also encourages technological developments that foster more ethical models. It is crucial for verification and reproducibility, which rely on complete access to the models, and helps diminish the increasing concentration of power in AI, offering more equitable access.”

In the coming months, AI2 plans to unveil larger and more advanced OLMo models, including multimodal versions that engage with various forms of data beyond text, along with supplementary datasets for training and fine-tuning. Similar to the initial launch of OLMo and Dolma, all materials will be freely accessible on GitHub and the AI hosting platform Hugging Face.

Google Unveils AI-Powered Image Generator: Revolutionizing Visual Content Creation

New Report Reveals How AI and Blockchains Can Support Each Other's Evolution

Most people like

ChatGPT

3.1B

Discover the power of engaging AI conversations and seamless task automation. Unlock the potential of advanced AI to enhance communication while streamlining your workflows for increased productivity.

AI AI Chatbot

Copyter

159.2K

Unlock the potential of an AI text generator designed to produce a wide range of high-quality content. Whether you need engaging articles, captivating blog posts, or informative product descriptions, this tool enhances your writing process. Discover how this innovative technology can elevate your content creation and streamline your workflow.

AI text generation AI Content Generator

Fork.ai

16.8K

Discover the app technology stack behind popular applications on Google Play and the App Store. Explore the tools and frameworks that power your favorite mobile apps!

Mobile app AI Analytics Assistant

Lawdeck

8.8K

Unlock the potential of AI-driven legal document creation and search capabilities. Discover how advanced artificial intelligence technologies streamline the drafting process and enhance your ability to locate essential legal documents quickly. Transform your legal practice with efficient tools designed to simplify complex tasks and improve accuracy in legal workflows. Optimize your legal operations today with cutting-edge AI solutions.

Legal document automation Legal Assistant

Find AI tools in YBX