On Wednesday, the Allen Institute for AI (AI2) unveiled an update to its 7 billion-parameter model, OLMo 1.7-7B. This enhanced version leverages a more extensive and varied Dolma dataset, along with an advanced training process.
Originally introduced in February, OLMo is positioned as a “truly open-source, state-of-the-art large language model,” complete with comprehensive pretraining data, training code, model weights, and evaluation metrics.
The latest update enables OLMo 1.7-7B to support a longer context length, expanding from 2,048 to 4,096 tokens, resulting in improved performance due to refined training techniques and architectural enhancements. The Dolma 1.7 dataset includes an impressive 2.3 trillion tokens sourced from diverse materials—ranging from Dolma CC and Refined Web to StarCoder, C4, Stack Exchange, OpenWebMath, Project Gutenberg, and Wikipedia.
Previously reliant on Dolma 1.5, which primarily utilized web data, the new Dolma 1.7 enhances the model’s ability to handle tasks requiring specialized knowledge, intricate reasoning, and coding by diversifying the data sources. AI2 implemented better deduplication methods to ensure content quality, removing documents with a duplication score exceeding a predetermined threshold, calculated from paragraph-level duplication scores.
Dolma 1.7 also introduces a refined quality filtering system. A FastText classifier evaluates documents based on their quality, distinguishing well-structured content from lower-quality material. High-quality sources include Wikipedia, Small Web RSS feeds, and Semantic Scholar, while low-quality documents comprise adult content and misinformation sites. This classifier was trained on approximately 25 GB of data.
Additionally, OLMo 1.7 employs a two-stage training curriculum. Initially, researchers train the model from the ground up. In the second stage, the model is further trained with a curated subset of Dolma 1.7, utilizing an additional 50 billion tokens while gradually reducing the learning rate to zero. The curated high-quality subset is formed by including all possible Wikipedia, OpenWebMath, and Flan data, excluding certain sources, and balancing the proportions of remaining datasets.
AI2 asserts that these enhancements allow OLMo 1.7-7B to surpass both Llama 2-7B in the Massive Multitask Language Understanding (MMLU) benchmark and Llama-2-13B on the GSM8K dataset.
The updated OLMo model is licensed under Apache 2.0, while Dolma 1.7 is available under ODC-BY. Both are accessible on Hugging Face now.