Microsoft is pushing the boundaries of AI innovation beyond its partnership with OpenAI. Today, the tech giant unveiled three new models in its Phi series of language and multimodal AI, positioning itself as a formidable player in the AI landscape.
Introducing the Phi 3.5 Models
The newly released Phi 3.5 models include:
- Phi-3.5-mini-instruct: 3.82 billion parameters
- Phi-3.5-MoE-instruct: 41.9 billion parameters
- Phi-3.5-vision-instruct: 4.15 billion parameters
Each model is optimized for specific tasks: Phi-3.5-mini for basic and rapid reasoning, Phi-3.5-MoE for advanced reasoning, and Phi-3.5-vision for image and video analysis. Developers can download, customize, and fine-tune these models on Hugging Face, all under a Microsoft-branded MIT License permitting unrestricted commercial usage.
Remarkably, these models deliver near state-of-the-art performance across various third-party benchmarks, outperforming notable competitors like Google's Gemini 1.5, Meta's Llama 3.1, and even OpenAI's GPT-4o in certain tests. This impressive performance has sparked praise for Microsoft across social media platforms.
Model Overviews
1. Phi-3.5 Mini Instruct: For Compute-Constrained Environments
The Phi-3.5 Mini Instruct model, with its 3.8 billion parameters, is designed for environments with limited memory and computing power. It supports a 128k token context length, making it ideal for tasks such as code generation, mathematical problem-solving, and logic-based reasoning. Despite its smaller size, it exhibits competitive performance in multilingual and multi-turn conversations, outperforming similar models in long-context code understanding.
2. Phi-3.5 MoE: Mixture of Experts
The Phi-3.5 MoE model represents Microsoft's foray into the Mixture of Experts architecture, which combines multiple specialized models into one. With 42 billion active parameters and a 128k token context length, this model delivers scalable performance across various reasoning tasks. It frequently surpasses larger models in benchmarks, including significant strides in STEM and humanities subjects on the MMLU (Massive Multitask Language Understanding) test.
3. Phi-3.5 Vision Instruct: Advanced Multimodal Reasoning
The Phi-3.5 Vision Instruct model combines text and image processing, excelling at tasks such as image comprehension, optical character recognition, and video summarization. Like its counterparts, it supports a 128k token context length, allowing it to handle complex visual tasks. Microsoft trained this model on a mix of synthetic and publicly available datasets, emphasizing high-quality, reasoning-rich data.
Training the Phi Trio
- Phi-3.5 Mini Instruct: Trained on 3.4 trillion tokens over 10 days using 512 H100-80G GPUs.
- Phi-3.5 Vision Instruct: Trained on 500 billion tokens over 6 days with 256 A100-80G GPUs.
- Phi-3.5 MoE: Trained on 4.9 trillion tokens over 23 days using 512 H100-80G GPUs.
Open Source Commitment
All three Phi-3.5 models are released under the MIT license, showcasing Microsoft's dedication to the open-source community. This license allows developers to utilize, modify, and distribute the software freely while stating that it is provided “as is,” devoid of warranties.
Microsoft's introduction of the Phi-3.5 series marks a pivotal advancement in multilingual and multimodal AI, equipping developers to integrate cutting-edge capabilities into their applications and driving innovation in both commercial and research sectors.