As excitement builds around the capabilities of the new GPT-4o-mini, Apple has expanded its collection of compact AI models with the release of several open DataComp for Language Models (DCLM) models on Hugging Face.
The package includes two significant models: one with 7 billion parameters and another with 1.4 billion. Both models excel in benchmarking tests, particularly the larger model, which outperforms Mistral-7B and is rapidly approaching the performance of other leading open models like Llama 3 and Gemma.
Vaishaal Shankar from the Apple ML team refers to these models as the “best-performing” open-source options available. Notably, the project has fully embraced open-source principles by releasing model weights, training code, and the pretraining dataset.
Overview of Apple DCLM Models
The DataComp project is a collaborative initiative involving researchers from Apple, the University of Washington, Tel Aviv University, and the Toyota Institute of Research. Its goal is to create high-quality datasets for training AI models, particularly in the multimodal domain. The team employs a standardized framework with fixed model architectures, training code, hyperparameters, and evaluations to test various data curation strategies to optimize model performance.
Early experiments revealed that model-based filtering—where machine learning models filter and select high-quality data from larger datasets—plays a critical role in assembling superior training sets. Using this curation technique, the team developed the DCLM-Baseline dataset, which was instrumental in training the 7 billion and 1.4 billion parameter decoder-only transformer models from scratch.
The 7B model, trained on 2.5 trillion tokens using OpenLM pretraining recipes, features a 2K context window and achieves 63.7% 5-shot accuracy on the MMLU benchmark. This marks a 6.6 percentage point improvement over MAP-Neo, the previous leader in open data language models, while utilizing 40% less computing power during training.
Crucially, its MMLU performance is in close range with leading models that feature open weights but closed data, such as Mistral-7B-v0.3 (62.7%), Llama3 8B (66.2%), Google’s Gemma (64.3%), and Microsoft’s Phi-3 (69.9%).
Additionally, when researchers lengthened the model's context to 8K and conducted 100 billion more training iterations using the Dataset Decomposition technique, they observed further performance improvements across Core and Extended benchmarks, although MMLU results remained consistent.
“Our findings underscore the significance of dataset design in training language models and serve as a foundation for ongoing research in data curation,” the researchers stated in a paper on DataComp-LM.
Impressive Performance of the Smaller Model
Similar to the DCLM-7B, the smaller 1.4B model—developed collaboratively with the Toyota Research Institute using 2.6 trillion tokens—also shows remarkable performance in MMLU, Core, and Extended tests. In the 5-shot MMLU assessment, it achieved 41.9%, surpassing other models in its category, including Hugging Face’s SmolLM, which had an MMLU score of 39.97%. Qwen-1.5B and Phi-1.5B followed with scores of 37.87% and 35.90%, respectively.
Currently, the 7B model is available under Apple’s Sample Code License, while the 1.4B model has been released under Apache 2.0, permitting commercial use, distribution, and modification. Additionally, an instruction-tuned version of the 7B model is available in the Hugging Face library.
It is essential to highlight that this release represents early research emphasizing data curation effectiveness. These models are not intended for Apple devices and may exhibit biases from their training datasets or produce potentially harmful responses.