The prevailing belief is that major companies like Google, OpenAI, and Anthropic, armed with endless funding and hundreds of elite researchers, are the only entities capable of developing state-of-the-art foundation models. However, as one of them famously stated, they “have no moat.” Ai2 just demonstrated this with the launch of Molmo, a groundbreaking multimodal AI model that rivals the best in the industry—while being compact, free, and genuinely open source.
To clarify, Molmo (short for multimodal open language model) serves as a visual understanding engine rather than a comprehensive chatbot akin to ChatGPT. It lacks an API, is not designed for enterprise integration, and does not perform web searches. Instead, you can think of it as the component within larger models that interprets images, comprehends them, and provides descriptions or answers to questions about them.
Molmo comes in three sizes: 72B, 7B, and 1B parameters. Like other multimodal models, it can identify and respond to queries about a variety of everyday scenarios and objects. For instance, it can assist with questions like, “How do you operate this coffee maker?” or “How many dogs in this picture have their tongues out?” It’s designed for visual understanding tasks that have been demonstrated over the years with varying levels of success.
What sets Molmo apart isn’t just its capabilities—which you can explore in the demo below—but the innovative approach it employs to achieve them. Visual understanding encompasses a wide range of tasks, from counting sheep in a field to gauging a person's emotional state or summarizing menu options. As Ai2 CEO Ali Farhadi explained during a demo at their Seattle headquarters, assessing and demonstrating these models can reveal their similarities in capabilities.
“One thing we’re highlighting today is that open equals closed,” he stated, “and small equals big.” (He clarified that he meant a form of equivalency, not identity, a distinction that some may find notable.)
Historically, the mantra in AI development has been “bigger is better.” This means more training data, more model parameters, and more computational power. However, there comes a point where increasing size becomes impractical due to limitations in available data or soaring costs. Hence, the focus has shifted towards maximizing the potential of existing resources—effectively doing more with less.
Farhadi pointed out that Molmo, which operates on par with leading models like GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet, does so with an estimated size that is about a tenth theirs. It matches their capabilities while being significantly smaller.
“There are numerous benchmarks for evaluation, and while I’m not fond of this scientifically, I had to present some figures,” he noted. “Our largest model, the 72B, outperforms GPTs, Claudes, and Geminis on these benchmarks. However, I urge caution; it doesn’t definitively mean it’s better. Still, it indicates we’re competing on the same level.”
If you wish to challenge its abilities, feel free to explore the public demo, which is also mobile-friendly. (You can even refresh or edit the original prompt to change the image without having to log in.)
The secret behind Molmo’s success lies in utilizing a smaller, yet higher-quality dataset. Instead of relying on billions of images that are difficult to quality control or describe, Ai2 meticulously curated a set of 600,000 images. Though still substantial, this is a fraction of what is typically used in the industry. While this selection process may omit some less common examples, it yields high-quality annotations and definitions.
Curious about their method? They ask people to describe images aloud rather than in writing. This approach results in rich, conversational descriptions that are both accurate and practically useful.
Molmo showcases this in its innovative ability to “point” to relevant areas in images. For instance, if it needs to count the dogs in a photo, it places a dot on each dog’s face. When asked about tongues, it marks each one accordingly. This specific capability opens up new possibilities for zero-shot actions. Remarkably, it can also navigate web pages without prior knowledge of the website’s code, submitting forms and performing related tasks seamlessly.
So why is this significant? With new models emerging nearly every day—from Google’s recent announcements to OpenAI’s upcoming demo—there’s constant hype. Yet, Molmo stands out because it is entirely free and open source. It can operate locally, eliminating the need for APIs, subscriptions, or high-performance GPU setups. The model aims to empower developers and creators to build AI-driven applications and services without the constraints of large tech companies.
Farhadi expressed that the objective is to reach a broad audience, including researchers, developers, and those new to AI models. “Our goal is to make it more accessible,” he stated. “We’re releasing everything we’ve developed: data, cleaning methods, annotations, training processes, code, checkpoints, and evaluation details.”
He envisions that developers will begin utilizing this dataset and code right away—even deep-pocketed competitors who often absorb publicly available data for their own use.
The pace of innovation in the AI landscape is rapid. However, major players are racing to offer lower prices while raising substantial funds to maintain their operations. If similar capabilities can be accessed through free, open-source alternatives, one must question whether the astronomical valuations of these companies are justified. Ultimately, Molmo illustrates that while the emperor’s attire may be debatable, it is clear he lacks a protective moat.