As Google unveiled a series of artificial intelligence advancements at its Cloud Next conference, Mistral AI, the rising star in the AI field, launched its latest sparse mixture of experts (SMoE) model, Mixtral 8x22B. Instead of the conventional demo video or blog post typically used by competitors, this Paris-based startup chose an unconventional route by sharing a torrent link on X, allowing users to download and test the new model directly.
This release marks Mistral's third major model introduction in just a few days, following the rollout of GPT-4 Turbo with vision and Gemini 1.5 Pro. Meta also teased the upcoming launch of Llama 3 next month.
Mistral's torrent file includes four components totaling 262GB, and while detailed capabilities of Mixtral 8x22B are yet to be disclosed, AI enthusiasts expressed excitement over its potential. However, running the model locally could pose challenges. As one Reddit user noted, “When I bought my M1 Max Macbook, I thought 32 GB would be overkill… I never thought my interest in AI would suddenly make that far from enough.”
Shortly after announcing Mixtral 8x22B, Mistral made the model available on Hugging Face for further training and deployment, emphasizing that the pretrained model lacks moderation mechanisms. Together AI has also provided access for users to experiment with it.
Utilizing its sparse MoE approach, Mistral seeks to deliver a powerful combination of specialized models, each tailored to specific tasks for optimized performance and cost efficiency. "At every layer, for every token, a router network selects two of these ‘experts’ to process the token and combines their outputs additively. This method enhances the number of model parameters while managing cost and latency since the model activates only a fraction of total parameters for each token," explains Mistral on its website.
Previously, the company introduced Mixtral 8x7B, featuring 46.7 billion total parameters but using only 12.9 billion per token, allowing for input processing and output generation at the same speed and cost as a 12.9 billion model. In the latest release, Reddit discussions suggest a total of 130 billion parameters, with 38 billion active parameters engaged in token generation, assuming two experts are activated simultaneously.
While the actual performance of Mixtral 8x22B across benchmarks remains to be determined, expectations are high. Users believe it will build upon the success of Mixtral, which outperformed both Meta’s Llama 2 70B and OpenAI’s GPT-3.5 across numerous benchmarks, including GSM-8K and MMLU, while delivering faster inference times.