As companies seek to leverage the growing interest in artificial intelligence (AI), Cloudflare is introducing a suite of innovative products designed to assist customers in building, deploying, and managing AI models at the network edge.
Among the new offerings, Workers AI enables clients to access geographically close GPUs provided by Cloudflare partners, facilitating AI model execution on a flexible, pay-as-you-go basis. Another product, Vectorize, serves as a vector database that securely stores vector embeddings—mathematical representations of data—generated by models using Workers AI. Additionally, AI Gateway offers metrics that help clients effectively manage the costs associated with running AI applications.
Cloudflare CEO Matthew Prince emphasizes the launch of this AI-focused product suite stems from customer demand for a more straightforward, cost-effective AI management solution. “Current market offerings are often overly complex—requiring a patchwork of vendors, leading to rapid expense increases,” Prince shared in an email interview. “There’s also a significant lack of visibility surrounding AI spending; as costs rise, observability becomes a key issue. We aim to streamline these processes for developers.”
To optimize user experience, Workers AI ensures AI inference occurs on GPUs closest to users, which minimizes latency and enhances performance. By utilizing ONNX—a Microsoft-backed toolkit that helps convert between diverse AI frameworks—Workers AI allows models to run where bandwidth, latency, connectivity, and processing can be efficiently managed.
Users of Workers AI can select models from a diverse catalog, featuring large language models (LLMs) like Meta’s Llama 2, automatic speech recognition systems, image classifiers, and sentiment analysis tools. Importantly, data processed through Workers AI remains in the original server region, and inference-related data, like prompts for LLMs or images, are not used for training current or future models.
“Optimal inference occurs near the user to ensure low-latency performance,” Prince noted. “However, many devices lack the computational power to manage sizable models such as LLMs. Simultaneously, traditional centralized clouds are often based far from end users, particularly in the U.S., complicating matters for international businesses that prefer not to export data. Cloudflare addresses both challenges effectively.”
Workers AI has already partnered with AI startup Hugging Face, which will optimize generative AI models for the platform. In turn, Cloudflare will serve as the first serverless GPU partner for deploying Hugging Face models. Similarly, Databricks plans to integrate AI inference into Workers AI via MLflow, its open-source platform designed for managing machine learning workflows. Cloudflare will actively contribute to the MLflow project, facilitating the deployment of MLflow capabilities for developers using the Workers AI platform.
Vectorize caters to clients that require storage for vector embeddings used in AI models. Vector embeddings are crucial as they represent training data in a compact manner while retaining essential characteristics.
Through Workers AI, users can generate embeddings to be stored in Vectorize. Alternatively, clients can store embeddings produced by third-party models from vendors like OpenAI and Cohere. While vector databases are not novel—startups like Pinecone and major cloud providers such as AWS, Azure, and Google Cloud offer similar services—Prince argues that Vectorize benefits from Cloudflare’s expansive global network, enabling queries to occur closer to users, thereby minimizing latency and improving inference times.
“Assembling an AI application today requires management of infrastructure that many developers find inaccessible,” Prince remarked. “We’re dedicated to simplifying the process from the outset. By integrating this technology into our existing network, we enhance performance while lowering costs.”
The final component of the AI suite, AI Gateway, features observability tools that help track AI traffic. This includes monitoring the number of model inference requests, their durations, user engagement, and overall application costs. Furthermore, AI Gateway offers cost-reduction features such as caching responses from LLMs to frequently asked questions, reducing the need for generating new responses from scratch, alongside rate limiting capabilities that bolster control over application scalability.
Prince claims that with AI Gateway, Cloudflare stands out as one of the few providers that allows developers and companies to pay solely for actual compute usage. While other tools, like GPTCache, can replicate aspects of this caching functionality, Prince believes Cloudflare’s solution is streamlined compared to competitors.
The effectiveness of this offering remains to be seen. “Presently, customers are funding substantial amounts of idle compute, in the form of virtual machines and GPUs, which often go unused,” Prince explained. “We recognize an opportunity to mitigate the complexities associated with machine learning operations today, providing a cohesive solution for developers.”