The Allen Institute for AI (AI2), in collaboration with Contextual AI, has launched an innovative open-source large language model (LLM) called OLMoE. This model aims to balance strong performance with cost-effectiveness.
OLMoE features a sparse mixture of experts (MoE) architecture, consisting of 7 billion parameters while utilizing only 1 billion parameters for each input token. It comes in two versions: OLMoE-1B-7B for general use and OLMoE-1B-7B-Instruct for instruction tuning.
Unlike many other MoE models, OLMoE is fully open-source. AI2 highlights the challenges in accessing other MoE models, as they often lack transparency regarding training data, code, or construction methods. “Most MoE models are closed source, providing limited insights into their training data or methodologies, which hinders the development of cost-efficient open MoEs that can rival closed-source models,” AI2 stated in their paper. This lack of accessibility presents a significant barrier for researchers and academics.
Nathan Lambert, an AI2 research scientist, noted on X (formerly Twitter) that OLMoE could support policy development, serving as a foundational tool as academic H100 clusters become available. He emphasized AI2’s commitment to releasing competitive open-source models, stating, “We’ve improved our infrastructure and data without altering our core goals. This model is truly state-of-the-art, not just the best on a couple of evaluations.”
Building OLMoE
In developing OLMoE, AI2 adopted a fine-grained routing approach utilizing 64 small experts, activating only eight at any time. This configuration yielded performance comparable to other models but significantly reduced inference costs and memory requirements.
OLMoE builds upon AI2’s previous open-source model, OLMO 1.7-7B, which supported a context window of 4,096 tokens, using a training dataset called Dolma 1.7. For its training, OLMoE incorporated a diverse dataset including subsets from Common Crawl, Dolma CC, Refined Web, StarCoder, C4, Stack Exchange, OpenWebMath, Project Gutenberg, and Wikipedia.
AI2 claims that OLMoE “outperforms all existing models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B.” Benchmark results indicate that OLMoE-1B-7B often competes closely with models having 7 billion parameters or more, such as Mistral-7B, Llama 3.1-B, and Gemma 2. In tests against 1 billion parameter models, OLMoE-1B-7B significantly outperformed other open-source models, including Pythia, TinyLlama, and even AI2’s own OLMO.
The Case for Open-Source MoE
AI2's mission includes enhancing accessibility to fully open-source AI models, particularly within the increasingly popular MoE architecture. Many developers are turning to MoE systems, as seen in Mistral’s Mixtral 8x22B and Grok from X.ai, with speculation surrounding the potential use of MoE in GPT-4. However, AI2 and Contextual AI point out that many existing AI models lack comprehensive transparency regarding their training data and codebases.
AI2 underscores the necessity for openness in MoE models, which introduce unique design challenges, such as determining the ratio of total to active parameters, deciding between numerous small experts or fewer large ones, sharing experts, and choosing appropriate routing algorithms.
Furthermore, the Open Source Initiative is actively addressing what constitutes openness for AI models, highlighting the importance of transparency in advancing the field.