Revolutionary Technique Boosts Language Model Speed by 300 Times

Home AI News Revolutionary Technique Boosts Language Model Speed by 300 Times

Updated on November 24 2023

Researchers at ETH Zurich have pioneered a groundbreaking technique that dramatically increases the efficiency of neural networks. By modifying the inference process, they have significantly reduced the computational demands of these networks.

In their experiments with BERT, a widely-used transformer model for various language tasks, the researchers achieved a remarkable reduction of over 99% in computations. This cutting-edge method can also be applied to transformer models that power large language models (LLMs) like GPT-3, paving the way for accelerated and more efficient language processing.

Understanding Fast Feedforward Networks

Transformers, the backbone of LLMs, consist of multiple layers, including attention and feedforward layers. The feedforward layers, which encompass a significant portion of the model’s parameters, are computationally intensive due to the need to compute the product of all neurons across input dimensions.

The researchers found that not all neurons in the feedforward layers need to be activated for every input during inference. They introduced “fast feedforward” layers (FFF) to replace conventional feedforward layers.

FFF employs conditional matrix multiplication (CMM), a mathematical operation that replaces the dense matrix multiplications (DMM) of traditional feedforward networks. While DMM involves multiplying all input parameters by all neurons, CMM selectively uses only a subset of neurons for each input, thus streamlining the processing and reducing the computational burden.

FastBERT: A Game-Changer in Language Processing

To test their innovative technique, the researchers developed FastBERT, a modified version of Google’s BERT model. FastBERT enhances performance by substituting the standard feedforward layers with fast feedforward layers, organizing neurons into a balanced binary tree structure that activates only one branch based on specific inputs.

To assess FastBERT's capabilities, the team fine-tuned various models on the General Language Understanding Evaluation (GLUE) benchmark—a suite designed to evaluate natural language understanding systems.

The results were striking: FastBERT performed similarly to base BERT models of comparable size and training. Variants fine-tuned for just one day on a single A6000 GPU maintained at least 96.0% of BERT's performance. Notably, the best variant matched BERT's performance while utilizing only 0.3% of its neurons.

The researchers assert that integrating fast feedforward networks into LLMs holds tremendous promise for enhancing speed. For example, in GPT-3, each transformer layer contains 49,152 neurons; with FFF, this could be optimized to use only 16 neurons during inference, representing around 0.03% of GPT-3’s neurons.

Addressing Optimization Challenges

While dense matrix multiplication has seen substantial optimization over the years, the same cannot be said for conditional matrix multiplication. The researchers noted, “Dense matrix multiplication is the most optimized mathematical operation in computing history.” Current deep learning frameworks offer limited support for CMM, predominantly through high-level simulations.

To advance this research, the team developed their own implementation of CMM operations, which resulted in an impressive 78x speed improvement during inference. They believe that with improved hardware and better low-level algorithm implementations, speeds could exceed a 300x enhancement. This would significantly tackle one of the pressing challenges in language models: generating tokens more rapidly.

Conclusion

The promise of a theoretical speedup of 341x for BERT-base models highlights the transformative potential of their work. The researchers hope to inspire further development of conditional neural execution primitives within device programming interfaces. This research is a critical step toward addressing the memory and computational limitations of large language models, fostering the development of more efficient and robust AI systems.

How Generative AI Will Strengthen Cybersecurity in a Zero-Trust Environment

Anthropic Cuts AI Pricing to Stay Competitive in a Growing Market

Most people like

LegalForce

232.8K

Enhance Your Contract Review Process with Our AI Platform: Boost Quality and Efficiency In today’s fast-paced business environment, ensuring the accuracy and efficiency of contract reviews is crucial. Our innovative AI platform is designed to significantly improve the quality of contract analysis while streamlining the review process. Discover how leveraging advanced artificial intelligence can transform your contract management, saving time and reducing errors. Embrace the future of contract review with unparalleled efficiency and effectiveness.

AI contract review AI Contract Management

NSFWBots

94.7K

Discover All AI Sex Chatbots in One Convenient Location! Explore a comprehensive collection of AI-driven sex chatbots, bringing together the most innovative and engaging experiences available. Whether you're seeking companionship, advice, or entertainment, our curated selection ensures you find the right chatbot to meet your desires.

AI sex chatbots AI Chatbot

Linguix

188.1K

Linguix enhances your writing through advanced grammar and spell checking, efficient text rewriting, and a variety of additional features designed to polish your content.

Writing assistant Writing Assistants

Docswrite

11.8K

Enhance Your Content Publishing Efficiency and Save Time with Docswrite!

content publishing AI Blog Writer

Find AI tools in YBX