Large language models like ChatGPT and Llama-2 are well-known for their extensive memory and computational requirements, which make them expensive to operate. Reducing even a small portion of their size can lead to significant cost savings.
To tackle this challenge, researchers at ETH Zurich have introduced an innovative version of the transformer—a deep learning architecture that serves as the foundation for language models. This new design significantly reduces the transformer's size while maintaining accuracy and enhancing inference speed, showcasing a promising approach for creating more efficient language models.
Understanding Transformer Blocks
Language models rely on transformer blocks, which are uniform units designed to process sequential data, such as text passages.
A classic transformer block comprises two key components: the attention mechanism and the multi-layer perceptron (MLP). The attention mechanism selectively highlights parts of the input data (like words in a sentence), capturing their context and significance in relation to one another. This capability allows the model to understand word relationships, even when they are distant in the text.
Following the attention mechanism, the MLP—a smaller neural network—further refines the highlighted information, transforming it into a more sophisticated representation that captures complex relationships.
Additional components like residual connections and normalization layers enhance learning and address common challenges in deep neural networks. As these transformer blocks stack to form a language model, their ability to recognize complex relationships grows, enabling the advanced tasks performed by modern language models. Despite their revolutionary impact, the basic design of the transformer block has largely remained unchanged since inception.
Enhancing Transformer Efficiency
According to the ETH Zurich researchers, “Given the exorbitant cost of training and deploying large transformer models nowadays, any efficiency gains in the training and inference pipelines for the transformer architecture represent significant potential savings.” They argue that simplifying the transformer block by removing non-essential components minimizes the parameter count and boosts model throughput.
Their experiments reveal that streamlining the transformer block does not compromise training speed or performance. Traditional transformer models utilize multiple attention heads, each with its own set of key (K), query (Q), and value (V) parameters, which together facilitate the mapping of input token relationships. The researchers found that eliminating the V parameters and the associated projection layer did not diminish effectiveness.
Additionally, they removed skip connections, which typically prevent the “vanishing gradients” problem that hampers training in deep networks.
New Transformer Block Design
The redesigned transformer block processes attention heads and the MLP concurrently, departing from traditional sequential processing. To counterbalance the reduction in parameters, researchers adjusted other non-learnable parameters, refined their training methods, and made architectural tweaks. These innovations collectively preserve the model's learning capabilities despite its leaner framework.
Testing the Improved Transformer Block
The ETH Zurich team assessed their compact transformer block across various language model depths. They achieved a remarkable reduction in the conventional transformer's size by approximately 16% without sacrificing accuracy, while also securing faster inference times. For instance, applying this architecture to a large model like GPT-3, with 175 billion parameters, could save around 50 GB of memory.
“Our simplified models not only train faster but also better utilize the additional capacity provided by greater depth,” the researchers noted. While this technique has shown effectiveness on a smaller scale, its application to larger models remains to be explored. The potential for further enhancements, such as customizing AI processors for this streamlined architecture, could significantly amplify its impact.
The researchers conclude, “We believe our work can lead to simpler architectures being adopted in practice, bridging the gap between theory and application in deep learning, and reducing the costs associated with large transformer models.”