Researchers at ETH Zurich have pioneered a groundbreaking technique that dramatically increases the efficiency of neural networks. By modifying the inference process, they have significantly reduced the computational demands of these networks.
In their experiments with BERT, a widely-used transformer model for various language tasks, the researchers achieved a remarkable reduction of over 99% in computations. This cutting-edge method can also be applied to transformer models that power large language models (LLMs) like GPT-3, paving the way for accelerated and more efficient language processing.
Understanding Fast Feedforward Networks
Transformers, the backbone of LLMs, consist of multiple layers, including attention and feedforward layers. The feedforward layers, which encompass a significant portion of the model’s parameters, are computationally intensive due to the need to compute the product of all neurons across input dimensions.
The researchers found that not all neurons in the feedforward layers need to be activated for every input during inference. They introduced “fast feedforward” layers (FFF) to replace conventional feedforward layers.
FFF employs conditional matrix multiplication (CMM), a mathematical operation that replaces the dense matrix multiplications (DMM) of traditional feedforward networks. While DMM involves multiplying all input parameters by all neurons, CMM selectively uses only a subset of neurons for each input, thus streamlining the processing and reducing the computational burden.
FastBERT: A Game-Changer in Language Processing
To test their innovative technique, the researchers developed FastBERT, a modified version of Google’s BERT model. FastBERT enhances performance by substituting the standard feedforward layers with fast feedforward layers, organizing neurons into a balanced binary tree structure that activates only one branch based on specific inputs.
To assess FastBERT's capabilities, the team fine-tuned various models on the General Language Understanding Evaluation (GLUE) benchmark—a suite designed to evaluate natural language understanding systems.
The results were striking: FastBERT performed similarly to base BERT models of comparable size and training. Variants fine-tuned for just one day on a single A6000 GPU maintained at least 96.0% of BERT's performance. Notably, the best variant matched BERT's performance while utilizing only 0.3% of its neurons.
The researchers assert that integrating fast feedforward networks into LLMs holds tremendous promise for enhancing speed. For example, in GPT-3, each transformer layer contains 49,152 neurons; with FFF, this could be optimized to use only 16 neurons during inference, representing around 0.03% of GPT-3’s neurons.
Addressing Optimization Challenges
While dense matrix multiplication has seen substantial optimization over the years, the same cannot be said for conditional matrix multiplication. The researchers noted, “Dense matrix multiplication is the most optimized mathematical operation in computing history.” Current deep learning frameworks offer limited support for CMM, predominantly through high-level simulations.
To advance this research, the team developed their own implementation of CMM operations, which resulted in an impressive 78x speed improvement during inference. They believe that with improved hardware and better low-level algorithm implementations, speeds could exceed a 300x enhancement. This would significantly tackle one of the pressing challenges in language models: generating tokens more rapidly.
Conclusion
The promise of a theoretical speedup of 341x for BERT-base models highlights the transformative potential of their work. The researchers hope to inspire further development of conditional neural execution primitives within device programming interfaces. This research is a critical step toward addressing the memory and computational limitations of large language models, fostering the development of more efficient and robust AI systems.