Mixture-of-Experts (MoE): A Scalable Approach for Large Language Models
Mixture-of-Experts (MoE) has gained traction as an effective method for scaling large language models (LLMs) while controlling computational costs. Rather than leveraging the full model capacity for every input, MoE architectures direct data to specialized “expert” modules. This approach allows LLMs to expand their parameters while minimizing inference costs and is employed in notable models such as Mixtral, DBRX, Grok, and reportedly GPT-4.
Despite its advantages, current MoE techniques face constraints that limit the number of experts available. In a recent paper, Google DeepMind introduced Parameter Efficient Expert Retrieval (PEER), a groundbreaking architecture designed to scale MoE models to millions of experts, enhancing the performance-compute tradeoff in large language models.
The Challenge of Scaling LLMs
Recent years have demonstrated that increasing the parameter count of language models generally enhances their performance and capabilities. However, this scaling comes with computational and memory bottlenecks.
In each transformer block of an LLM, attention layers analyze the relationships among input tokens, while feedforward (FFW) layers store the model's knowledge. FFW layers consist of two-thirds of the model's parameters, posing a significant scaling challenge. In traditional transformer architecture, all FFW parameters are active during inference, making their computational demand directly correlated with their size.
MoE addresses this issue by substituting a single dense FFW layer with sparsely activated expert modules. Each expert contains a subset of parameters and specializes in specific areas. A router intelligently assigns inputs to multiple experts anticipated to yield the most accurate responses.
By increasing the number of experts, MoE enhances LLM capacity without escalating computational costs.
Finding Optimal MoE Granularity
Research indicates that the optimal number of experts in an MoE model depends on several factors, such as the volume of training tokens and computational budget. When these elements are effectively balanced, MoEs tend to outperform dense models utilizing the same computational resources.
Moreover, increasing the “granularity” of an MoE model—referring to the expert count—can lead to performance improvements, especially when coupled with larger model sizes and training datasets. High-granularity MoE also facilitates more efficient learning, allowing models to adapt to ongoing data changes, which is crucial in dynamic deployment environments.
Current MoE methods, however, are often limited and lack scalability. They typically utilize fixed routers designed for a predetermined number of experts, necessitating readjustment when new experts are introduced.
Parameter Efficient Expert Retrieval (PEER)
DeepMind’s PEER architecture tackles the challenge of scaling MoE to millions of experts. It replaces fixed routers with a learned index to efficiently route input data to a vast expert pool. For each input, PEER conducts a rapid initial computation to generate a shortlist of potential experts before activating the most promising ones. This process allows MoE to manage a large expert pool without sacrificing speed.
Unlike traditional MoE architectures, where experts are comparable in size to the replaced FFW layers, PEER employs compact experts featuring a single neuron in the hidden layer. This design fosters improved knowledge transfer and parameter efficiency by enabling shared hidden neurons among experts. To address the small expert size, PEER incorporates a multi-head retrieval mechanism akin to the multi-head attention used in transformer models.
A PEER layer can seamlessly integrate into an existing transformer model or replace an FFW layer. It also aligns with parameter-efficient fine-tuning (PEFT) techniques, focusing on modifying the least number of parameters for new task adaptations. In PEER, parameter efficiency reduces the active parameters in the MoE layer, directly impacting computation and memory usage during pre-training and inference.
The PEER architecture has the potential to dynamically select PEFT adapters at runtime, allowing LLMs to incorporate new knowledge and functionalities.
PEER in Action
Researchers evaluated PEER's performance across various benchmarks, comparing it to transformer models equipped with dense FFW layers and other MoE architectures. The findings reveal that PEER models achieve superior performance-compute tradeoffs, reflecting lower perplexity scores within the same computational constraints as their counterparts.
Additionally, increasing the number of experts in a PEER model correlates with further perplexity reductions.
“This design demonstrates a superior compute-performance trade-off in our experiments, positioning it as a competitive alternative to dense FFW layers for scaling foundation models,” the researchers conclude.
This research challenges the prevailing notion that MoE models are most efficient with a limited number of experts. PEER illustrates that, with optimal retrieval and routing strategies, it is feasible to scale MoE to millions of experts, significantly reducing the cost and complexity of training and deploying massive language models.