The Allen Institute for AI (AI2), the nonprofit research organization established by the late Microsoft co-founder Paul Allen, is set to launch multiple GenAI language models that it claims are notably more “open” than existing options. These models are licensed in a way that allows developers to use them freely for training, experimentation, and even commercial applications.
Named OLMo, which stands for “Open Language Models,” these models were developed alongside Dolma—one of the largest public datasets for this purpose. AI2 senior software engineer Dirk Groeneveld explains that these resources are intended to advance our understanding of the complex science behind text-generating AI.
“‘Open’ is an ambiguous term in the context of text-generating models,” Groeneveld shared in an email interview. “We anticipate that researchers and practitioners will embrace the OLMo framework, which includes a model trained on one of the most extensive public datasets released to date, along with all the necessary components for model building.”
With numerous organizations, from Meta to Mistral, releasing capable open-sourced text-generating models for developer use and fine-tuning, Groeneveld argues that many of these offerings do not genuinely qualify as “open.” He emphasizes that they were developed under private conditions and leveraged proprietary datasets.
In a distinct approach, the OLMo models were crafted in collaboration with partners including Harvard, AMD, and Databricks. They include the source code used for compiling their training data, alongside training and evaluation metrics.
When it comes to performance, the leading OLMo model, OLMo 7B, offers a “compelling and strong” alternative to Meta’s Llama 2, depending on the use case. In benchmarks focusing on reading comprehension, OLMo 7B surpasses Llama 2. However, in question-answering assessments, it falls slightly short.
The OLMo models do face limitations, such as subpar performance in languages other than English—since Dolma predominantly features English content—and modest code-generating capabilities. However, Groeneveld emphasizes that this is just the beginning.
“OLMo is not designed to be multilingual—yet,” he noted. “[Currently], the primary focus was not code generation; to help future projects centered on code fine-tuning, OLMo’s dataset does contain about 15% code.”
I inquired about Groeneveld’s concerns regarding potential misuse of the OLMo models, particularly given that they can be deployed commercially and are powerful enough to operate on consumer GPUs like the Nvidia 3090. A recent study by Democracy Reporting International’s Disinfo Radar project highlighted that two popular open text-generating models—Hugging Face’s Zephyr and Databricks’ Dolly—tend to generate harmful content when prompted with malicious requests.
Groeneveld maintains that the advantages of open models outweigh the potential dangers.
“Establishing this open platform will actually enhance research into the risks of these models and identify solutions,” he explained. “While there’s a risk that open models may be misused, this approach also encourages technological developments that foster more ethical models. It is crucial for verification and reproducibility, which rely on complete access to the models, and helps diminish the increasing concentration of power in AI, offering more equitable access.”
In the coming months, AI2 plans to unveil larger and more advanced OLMo models, including multimodal versions that engage with various forms of data beyond text, along with supplementary datasets for training and fine-tuning. Similar to the initial launch of OLMo and Dolma, all materials will be freely accessible on GitHub and the AI hosting platform Hugging Face.