A recent study by researchers at Tsinghua University highlights how rearranging computations and hardware configurations for large language models (LLMs) can significantly lower inference costs. They introduce a technique called “attention offloading,” which utilizes cost-effective GPUs for memory-intensive tasks, allowing high-performance accelerators to focus on compute-heavy operations.
With high-end AI accelerators being expensive, scarce, and in high demand, attention offloading presents an opportunity for companies to optimize their hardware resources when deploying LLMs at scale.
Two Types of Computations
LLM inference involves various operations that need to be strategically organized to make the most of available memory and processing capabilities. These operations can be categorized into two main types: computation-bound and memory-bound. Computation-bound operations benefit from faster accelerators like the A100 and H100, while memory-bound operations, particularly the self-attention mechanism triggered by each new token, require ample video RAM (VRAM).
The researchers note, “This memory-bound workload conflicts with the strengths of modern accelerators, leading to overwhelmed memory controllers while computational cores stay idle.” This resource imbalance worsens with increasing sequence lengths, such as during extended user prompts or conversations with the model.
The Innovative Solution: Attention Offloading
Current approaches typically focus on scaling uniform architectures of high-end accelerators for inference. Companies often invest heavily in H100 processors to expand their inference capabilities, resulting in inflated costs and less-than-optimal hardware use.
The researchers argue, “The unique demands of the LLM generation phase necessitate a heterogeneous architecture for improved efficiency and reduced costs.”
Their study suggests that different types of accelerators are suited to specific facets of LLM inference. For instance, consumer-grade GPUs are economical options for memory-bound tasks, offering three times the memory capacity and bandwidth per dollar compared to high-end models. However, exclusively relying on these lower-cost options can be inefficient due to their limited compute power.
Attention computations, however, are highly parallelizable and can be distributed across multiple budget-friendly, memory-efficient GPUs.
Implementing a Heterogeneous Architecture
The attention offloading technique involves creating two distinct pools of accelerators: one focused on computational capabilities and the other optimized for memory bandwidth. This way, attention tasks are handled by lower-cost GPUs while high-end accelerators manage other operations.
The researchers explain, “This heterogeneous architecture allows for a service system that efficiently combines computational power, memory capacity, and bandwidth to enhance LLM inference without excessive costs.”
This strategic alignment of hardware strengths with operational requirements enables companies to maximize their budgets by investing in a balanced mix of memory and compute-optimized accelerators.
Addressing Architectural Challenges
The study further evaluates challenges associated with this heterogeneous architecture, particularly the bandwidth necessary for connecting the two accelerator pools. The findings indicate that not only can standard system buses like PCIe 4.0 suffice, but networking technologies such as 200Gb Infiniband and Ethernet, already common in AI data centers, are also adequate.
Utilizing advanced scheduling and pipelining techniques helps mitigate latency caused by the non-uniform architecture, ensuring that memory and compute resources work simultaneously without being hindered by sequential computations.
Introducing Lamina
The researchers developed Lamina, a distributed heterogeneous LLM inference system that employs attention offloading. Lamina utilizes consumer GPUs to store computed attention values (the “KV cache”) and perform attention operations, while high-end accelerators manage model parameters and other inference tasks. These devices can function within the same physical machine or be spread across multiple nodes.
By offloading KV cache storage and attention computations to memory-efficient GPUs, Lamina can handle batches that are 10.7–64 times larger than those managed by vLLM, a widely-used LLM serving platform. This efficiency is crucial for making optimal use of costly compute-optimized accelerators, especially in large-scale LLM deployments.
Experimental evaluations reveal that Lamina achieves throughput rates that are 1.48 to 12.1 times higher per cost compared to existing solutions for 13B and 33B models.
As LLMs become mainstream, companies will need innovative strategies for cost-effective inference and reduced capital outlays on accelerators—an objective that attention offloading successfully addresses. Although the researchers have not yet released the code for Lamina, the fundamentals are clearly outlined, making it likely to attract swift implementation by the open-source community.