You’ve proudly positioned your services as “AI-powered” through the integration of large language models (LLMs). Your website's homepage showcases the transformative impact of your AI-driven solutions with interactive demos and case studies, marking your entry into the global generative AI landscape.
Your small yet dedicated user base appreciates the enhanced customer experience, and growth opportunities are emerging. However, just three weeks into the month, an email from OpenAI takes you by surprise:
“Are you ready for AI agents?”
A week prior, you were conversing with customers, evaluating product-market fit, and suddenly, your website experiences a surge in traffic, crashing your AI-powered services.
This spike not only frustrates existing users but also deters new ones. A quick fix could involve increasing your usage limit, but this leaves you uneasy about relying on a single provider and losing control over your AI costs.
You ponder, “Should I self-host?”
Fortunately, open-source LLMs are readily available on platforms like Hugging Face, presenting the option for self-hosting. However, many of the leading models come with billions of parameters and require significant resources to scale, especially for low-latency applications.
While you're confident in your team's ability to build the necessary infrastructure, the potential costs of such a transition are daunting:
- Fine-tuning costs
- Hosting expenses
- Serving costs
So, the pressing question remains: Should you increase the usage limit or pursue self-hosting?
Evaluating LLaMA 2
Take your time; this is a significant decision.
Consulting with your machine learning engineers, you discover LLaMA 2, an open-source LLM that performs comparably to GPT-3, your current model. It comes in three sizes: 7 billion, 13 billion, and 70 billion parameters. You opt for the largest to remain competitive.
LLaMA 2, trained in bfloat16 format, requires 2 bytes per parameter, resulting in a total model size of 140 GB.
Worried about the complexity of fine-tuning a model this size? Don’t be. With LoRA, you may only need to fine-tune approximately 0.1% of the parameters—around 70 million—consuming just 0.14 GB.
To manage memory overhead during fine-tuning (including backpropagation and storing data), aim to maintain about five times the memory of the trainable parameters:
- Fixed LLaMA 2 model weights: 140 GB (no memory overhead)
- LoRA fine-tuning weights: 0.14 GB * 5 = 0.7 GB
This brings the total to approximately 141 GB during fine-tuning.
If you lack training infrastructure, consider using AWS. The on-demand pricing averages ~$2.80 per hour for compute, totaling ~$67 per day for fine-tuning—an affordable cost, especially since fine-tuning won't take long.
Understanding Serving Costs
In deployment, you must maintain two sets of weights in memory:
- Model weights: 140 GB
- LoRA fine-tuning weights: 0.14 GB
Totaling approximately 140.14 GB.
You might skip gradient computation, yet it's prudent to maintain around 1.5 times more memory for unforeseen overhead (about 210 GB).
On AWS, GPU computing costs around $3.70 per hour—or ~$90 per day—resulting in a monthly expenditure of roughly $2,700.
Additionally, plan for contingencies. To prevent service interruptions, consider maintaining a redundant model, increasing costs to about $180 per day or $5,400 per month—nearly matching your current OpenAI expenses.
Analyzing Cost Break-Even Points
Continuing with OpenAI will yield an approximate daily processing capacity to match the costs incurred with fine-tuning LLaMA 2:
Fine-tuning GPT 3.5 Turbo costs $0.008 per 1K tokens. Assuming two tokens per word, to balance the fine-tuning expenses of the open-source model ($67/day), you'd need to process roughly 4.15 million words daily—about 14,000 pages of data.
This volume may be unfeasible for most organizations to gather, meaning utilizing OpenAI for fine-tuning is typically more economical.
Summing Up: When Is Ownership Worth It?
Self-hosting AI may seem enticing at first glance, but be wary of hidden costs. While third-party providers alleviate many challenges in managing LLMs, they come with their own advantages, particularly for services that leverage AI rather than center on it.
For large corporations, the annual ownership cost of $65,000 may seem manageable, but for most businesses, it’s a significant figure. Don't overlook additional expenses for talent and maintenance, which can inflate total costs to $200,000-250,000 or more each year.
While owning a model grants control over data and usage, you must surpass around 22.2 million words per day in user requests, along with the logistical resources required to manage these demands. For many use cases, the financial benefits of self-hosting over using an API remain unclear.