The Allen Institute for AI (Ai2) has officially launched Molmo, an open-source suite of cutting-edge multimodal AI models that surpass top proprietary competitors, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5, on several third-party benchmarks.
As multimodal models, Molmo can analyze images and files, similar to leading proprietary foundation models. Notably, Ai2 claims that Molmo utilizes "1000x less data" than its proprietary counterparts, thanks to innovative training methods detailed in a newly published technical report by the Paul Allen-founded company, led by Ali Farhadi.
Ai2 also shared a demonstration video on YouTube, showcasing how Molmo operates on smartphones to efficiently analyze live scenes. Users can simply take a photo for immediate processing—examples include counting people, identifying vegan menu items, interpreting flyers, distinguishing electronic music bands, and converting handwritten notes from whiteboards into structured tables.
This release reflects Ai2's commitment to fostering open research by providing high-performance models, complete with accessible weights and data, for the broader community and enterprises seeking customizable solutions.
Molmo follows Ai2’s recent introduction of OLMoE, a cost-effective model utilizing a "mixture of experts" architecture.
Model Variants and Performance
Molmo comprises four primary models with varying parameter sizes and capabilities:
- Molmo-72B: The flagship model with 72 billion parameters, based on Alibaba Cloud’s Qwen2-72B.
- Molmo-7B-D: A demonstration model derived from Alibaba’s Qwen2-7B.
- Molmo-7B-O: Based on Ai2’s OLMo-7B.
- MolmoE-1B: An efficiency-focused model, nearly matching GPT-4V's performance on academic benchmarks and user preferences.
These models showcase impressive capabilities across various third-party benchmarks, consistently outperforming many proprietary alternatives. All models are available under permissive Apache 2.0 licenses, allowing for extensive research and commercial use.
Molmo-72B stands out in academic evaluations, achieving the highest scores on 11 key benchmarks and ranking second in user preference, just behind GPT-4o.
Machine learning developer advocate Vaibhav Srivastav from Hugging Face emphasized that Molmo establishes a robust alternative to closed systems, raising the bar for open multimodal AI. Additionally, Google DeepMind robotics researcher Ted Xiao praised Molmo’s incorporation of pointing data, a vital advancement for visual grounding in robotics, enhancing interaction with physical environments.
Advanced Architecture and Training
Molmo’s architecture is engineered for optimal efficiency and performance. Each model employs OpenAI’s ViT-L/14 336px CLIP model as a vision encoder, transforming multi-scale images into vision tokens. These tokens are processed through a multi-layer perceptron (MLP) connector before being integrated into the language model.
The training protocol consists of two crucial stages:
- Multimodal Pre-training: Models are trained to generate captions from detailed image descriptions provided by human annotators, using a high-quality dataset known as PixMo.
- Supervised Fine-Tuning: Models are fine-tuned on a diverse dataset that includes academic benchmarks and newly developed datasets, equipping them for complex tasks such as document reading and visual reasoning.
Unlike many contemporary models, Molmo does not rely on reinforcement learning from human feedback (RLHF), instead using a precisely calibrated training pipeline that updates all parameters based on pre-training states.
Benchmark Performance
The Molmo models exhibit outstanding results across various benchmarks, notably outpacing proprietary models. For example, Molmo-72B scores 96.3 on DocVQA and 85.5 on TextVQA, surpassing both Gemini 1.5 Pro and Claude 3.5 Sonnet. It also excels on Ai2D, with a score of 96.3, the highest among all model families.
Notably, Molmo-72B excels in visual grounding tasks, achieving top scores on RealWorldQA, making it a promising candidate for robotics and complex multimodal reasoning applications.
Open Access and Future Developments
Ai2 has made these models and datasets freely accessible on its Hugging Face space, ensuring compatibility with popular AI frameworks like Transformers. This initiative is part of Ai2’s mission to promote innovation and collaboration within the AI community.
In the coming months, Ai2 plans to release additional models, training codes, and an expanded technical report, further enhancing available resources for researchers. For those interested in Molmo’s capabilities, a public demo and model checkpoints are now available on Molmo’s official page.