Exploring the Costs and Benefits of AI with Serverless Infrastructure
Running AI applications incurs various costs, with GPU power for inference being one of the most critical expenses.
Traditionally, organizations managing AI inference have relied on continuous cloud instances or on-premises hardware. However, Google Cloud is now previewing an innovative solution that could transform AI application deployment: the integration of Nvidia L4 GPUs with its Cloud Run serverless offering, allowing organizations to perform serverless inference.
Harnessing the Power of Serverless Inference
The primary advantage of serverless architecture is its cost-efficiency; services operate only when needed, allowing users to pay solely for usage. Unlike conventional cloud instances that run continuously, serverless GPUs activate only during specific requests.
Serverless inference can utilize Nvidia NIM and various frameworks, including VLLM, PyTorch, and Ollama. Currently in preview, Nvidia L4 GPU support has been highly anticipated.
“As customers increasingly adopt AI, they want to deploy AI workloads on familiar platforms,” said Sagar Randive, Product Manager for Google Cloud Serverless. “Cloud Run’s efficiency and flexibility are crucial, and users have requested GPU support.”
The Shift to a Serverless AI Environment
Google’s Cloud Run, a fully managed serverless platform, has gained popularity among developers for its ease of container deployment and management. As AI workloads grow—especially those requiring real-time processing—the need for enhanced computational resources has become evident.
The addition of GPU support opens various possibilities for Cloud Run developers, such as:
- Real-time inference with lightweight models like Gemma 2B/7B or Llama 3 (8B), facilitating the development of responsive chatbots and dynamic document summarization tools.
- Custom fine-tuned generative AI models, enabling scalable image generation applications tailored to specific brands.
- Accelerated compute-intensive tasks, including image recognition, video transcoding, and 3D rendering, which can scale down to zero when idle.
Performance Considerations for Serverless AI Inference
One common concern associated with serverless architectures is performance, particularly with cold starts. Google Cloud addresses these concerns by providing impressive metrics: cold start times for various models, including Gemma 2B, Gemma 2 9B, Llama 2 7B/13B, and Llama 3.1 8B, range from 11 to 35 seconds.
Each Cloud Run instance can be equipped with one Nvidia L4 GPU, providing up to 24GB of vRAM—adequate for most AI inference tasks. Google Cloud aims to maintain model agnosticism, although they recommend using models with fewer than 13 billion parameters for optimal performance.
Cost-Efficiency of Serverless AI Inference
A significant advantage of the serverless model is its potential for better hardware utilization, which can translate to cost savings. However, whether serverless AI inference proves cheaper than traditional long-running servers depends on the specific application and expected traffic patterns.
“This is nuanced,” Randive explained. “We will update our pricing calculator to reflect the new GPU pricing with Cloud Run, allowing customers to compare their total operational costs across different platforms.”
By adapting to this emerging serverless policy, organizations can optimize their AI deployment strategies while managing costs effectively.