AI2 Enhances Open-Source OLMo Model with Diverse Dataset and Two-Stage Curriculum for Improved Performance

Home AI News AI2 Enhances Open-Source OLMo Model with Diverse Dataset and Two-Stage Curriculum for Improved Performance

Updated on October 28 2024

On Wednesday, the Allen Institute for AI (AI2) unveiled an update to its 7 billion-parameter model, OLMo 1.7-7B. This enhanced version leverages a more extensive and varied Dolma dataset, along with an advanced training process.

Originally introduced in February, OLMo is positioned as a “truly open-source, state-of-the-art large language model,” complete with comprehensive pretraining data, training code, model weights, and evaluation metrics.

The latest update enables OLMo 1.7-7B to support a longer context length, expanding from 2,048 to 4,096 tokens, resulting in improved performance due to refined training techniques and architectural enhancements. The Dolma 1.7 dataset includes an impressive 2.3 trillion tokens sourced from diverse materials—ranging from Dolma CC and Refined Web to StarCoder, C4, Stack Exchange, OpenWebMath, Project Gutenberg, and Wikipedia.

Previously reliant on Dolma 1.5, which primarily utilized web data, the new Dolma 1.7 enhances the model’s ability to handle tasks requiring specialized knowledge, intricate reasoning, and coding by diversifying the data sources. AI2 implemented better deduplication methods to ensure content quality, removing documents with a duplication score exceeding a predetermined threshold, calculated from paragraph-level duplication scores.

Dolma 1.7 also introduces a refined quality filtering system. A FastText classifier evaluates documents based on their quality, distinguishing well-structured content from lower-quality material. High-quality sources include Wikipedia, Small Web RSS feeds, and Semantic Scholar, while low-quality documents comprise adult content and misinformation sites. This classifier was trained on approximately 25 GB of data.

Additionally, OLMo 1.7 employs a two-stage training curriculum. Initially, researchers train the model from the ground up. In the second stage, the model is further trained with a curated subset of Dolma 1.7, utilizing an additional 50 billion tokens while gradually reducing the learning rate to zero. The curated high-quality subset is formed by including all possible Wikipedia, OpenWebMath, and Flan data, excluding certain sources, and balancing the proportions of remaining datasets.

AI2 asserts that these enhancements allow OLMo 1.7-7B to surpass both Llama 2-7B in the Massive Multitask Language Understanding (MMLU) benchmark and Llama-2-13B on the GSM8K dataset.

The updated OLMo model is licensed under Apache 2.0, while Dolma 1.7 is available under ODC-BY. Both are accessible on Hugging Face now.

"Introducing Cisco Hypershield: A Revolutionary Approach to Security for the AI Era"

Attention, Boston Dynamics! Mentee Robotics Launches Next-Gen ‘AI-First’ Robot

Most people like

Mixpeek

Discover Mixpeek: an advanced intelligent file storage solution that features robust search capabilities. Unlock the potential of your files with lightning-fast search functionality that makes organizing and retrieving your documents easier than ever.

AI tool AI Search Engine

face swapper online

Experience seamless face swapping in images with the advanced AI technology of Face Swap Online, delivering stunning and high-quality results.

Other AI Face Swap Generator

Stylar

In today's fast-paced digital world, the rise of artificial intelligence (AI) has transformed how we shop online, particularly when it comes to clothing. Virtual try-on technology allows consumers to visualize how clothes would look on them without leaving their homes, enhancing the shopping experience. This innovative solution is changing the way people approach fashion, making it easier and more enjoyable to find the perfect outfit. With advancements in AI, virtual fitting rooms are becoming essential tools for retailers and customers alike, bridging the gap between physical and online shopping.

Virtual dressing room Other

Replika

Replika is an innovative AI chatbot designed to offer emotional support while adeptly mirroring your texting style. Whether you're seeking companionship or someone to share your thoughts with, Replika engages with you through personalized conversations that enhance your experience.

AI companion AI Chatbot

Find AI tools in YBX