How Smaller LLMs Can Significantly Reduce Generative AI Costs

The escalating costs associated with large language models (LLMs) that drive generative AI are sparking considerable concern within the tech industry. However, smaller models present a promising solution. “The emergence of LLMs like GPT-4 has showcased remarkable advancements in performance, but these improvements have also led to surging costs,” stated Adnan Masood, the chief AI architect at UST, during a recent interview. He pointed out that the computational demands of LLMs—due to their massive sizes and billions of parameters—require significant power. This high computational intensity results in substantial energy consumption, which in turn hikes operational expenses and raises environmental concerns.

“With model sizes frequently exceeding GPU memory capacities, there is an increasing reliance on specialized hardware or complex model parallelism, which compounds infrastructure costs,” Masood added. He emphasized that smaller language models can not only lower costs but also enhance efficiency when carefully fine-tuned. Techniques such as model distillation and quantization can effectively compress and optimize these smaller models. Distillation involves training a smaller model on the outputs of a larger one, while quantization reduces the precision of the model's numerical weights, creating a model that is both smaller and faster.

The reduced parameter count of smaller models directly translates to lower computational power requirements, permitting quicker inferences and potentially shorter training times. “This compact footprint enables seamless integration within standard GPU memory, effectively eliminating the necessity for more expensive, specialized hardware setups,” he elaborated. This reduction in computational and memory usage not only decreases energy consumption but also cuts down operational costs. Leveraging APIs for early-stage proofs-of-concept or prototypes within production workloads further benefits organizations, particularly due to the lower per-token costs during scaling. However, Masood cautioned that relying solely on larger language models can lead to exponential cost spikes when applications experience rapid growth.

In addition to reducing training time and costs, smaller language models can significantly alleviate cloud infrastructure expenses, as highlighted by Matt Barrington, the Americas emerging technology leader for EY. For example, fine-tuning a domain-specific model on cloud platforms results in lower resource utilization. This shift enables companies to allocate their AI resources more effectively, focusing on areas that bring them closer to the end user. “By adopting compact language models in edge computing, businesses can decrease reliance on expensive cloud resources, leading to substantial cost savings,” he affirmed.

There are already several promising examples of efficient AI models currently being deployed. Recent models like phi-1.5 demonstrate performance capabilities that rival those of larger models such as GPT-4, according to Masood. Additionally, specialized models like Med-PaLM 2 are crafted specifically for the healthcare sector, and Sec-Palm is designed for security applications. Furthermore, models like Llama 2 70b are emerging as cost-effective alternatives, priced significantly lower than their competitors, like Google's PaLM 2, showcasing a marked reduction from previous model iterations. Notably, Meta's 13-billion-parameter LLaMA has outperformed the larger GPT-3 in several benchmarks.

Initiatives like the BabyLM challenge at Johns Hopkins University aim to enhance the effectiveness of smaller models to rival those of LLMs. Furthermore, Amazon offers a marketplace for these compact models that can be tailored to suit specific data needs of companies. Organizations like Anyscale and MosaicML are also selling models such as the 70 billion-parameter Llama 2 at affordable rates, highlighting a growing shift toward effective and budget-friendly solutions.

As large language model costs continue to surge, the urgency to find economically viable alternatives becomes increasingly apparent. The training of these models incurs significant expenses, particularly for GPUs like Nvidia’s H100, which can cost over $30,000 each. “There’s a waitlist for such GPUs, with some venture capitalists even using them to attract startups for funding,” noted Muddu Sudhakar, CEO of Aisera.

Even when acquiring GPUs, meaningful revenue generation is essential to offset their high costs, Sudhakar pointed out. He referenced a recent blog from the venture capital firm Sequoia, highlighting a significant monetization gap that could hinder the growth of the generative AI market. “Once the GPU is secured, companies face the challenge of recruiting data scientists, whose compensation packages can be substantial,” he explained. “Moreover, operationalizing LLMs is costly due to the ongoing demands of processing interactions, managing and upgrading models, and addressing various security issues.”

Looking ahead, Masood envisions fine-tuned LLMs reaching performance levels akin to their larger counterparts but at a fraction of the cost. The open-source community is already tackling practical challenges with innovations like LongLoRA, which significantly extends context windows. “If current trends are any indication, we may soon witness a synthesis of open-source models and smaller LLMs, forming the foundation of the next-generation language modeling ecosystem,” he concluded.

Most people like

Find AI tools in YBX

Related Articles
Refresh Articles