Anthropic has launched a public beta of prompt caching for its API, allowing developers to maintain context between API calls and eliminate redundancy in prompts. Currently, this feature is available for Claude 3.5 Sonnet and Claude 3 Haiku, with support for the larger Claude model, Opus, expected soon.
Prompt caching, outlined in a 2023 paper, enables users to save frequently used contexts during sessions. This means that users can add more background information without incurring higher costs, particularly beneficial when sending extensive context and needing to reference it across multiple conversations. This functionality also empowers developers to fine-tune model responses more effectively.
Early adopters have reported significant speed and cost efficiencies with prompt caching across various applications, such as integrating a comprehensive knowledge base or managing 100-shot examples, as well as tracking conversation turns within prompts.
Potential use cases for prompt caching include reducing costs and latency for lengthy instructions, streamlining document uploads for conversational agents, enhancing code autocompletion, and embedding entire documents within prompts.
Pricing for Cached Prompts
One of the primary advantages of caching prompts is the reduced cost per token. Anthropic indicates that using cached prompts is "significantly cheaper" than the standard input token price.
For Claude 3.5 Sonnet, sending a prompt for caching costs $3.75 per million tokens (MTok), while utilizing a cached prompt is only $0.30 per MTok. The base input price for Claude 3.5 Sonnet stands at $3/MTok, meaning that an initial higher investment can yield up to a 10x cost reduction in subsequent uses.
For Claude 3 Haiku, the cost is $0.30/MTok to cache prompts and $0.03/MTok for using stored prompts. Although prompt caching isn't yet available for Claude 3 Opus, pricing has been announced: caching will cost $18.75/MTok, while accessing cached prompts will be $1.50/MTok.
Notably, as AI influencer Simon Willison pointed out, Anthropic’s cache has a five-minute lifespan and resets upon each use.
Competitive Landscape
Anthropic is vying for a competitive edge in the AI field through aggressive pricing strategies. Prior to the Claude 3 family of models, the company reduced token prices to stay competitive with rivals like Google and OpenAI, all of whom are engaged in a price drop competition aimed at third-party developers.
A Highly Requested Feature
Prompt caching is not exclusive to Anthropic; other platforms, such as Lamina, utilize KV caching to lessen GPU costs. A glance at OpenAI's developer forums reveals numerous inquiries about prompt caching.
It’s important to differentiate between cached prompts and large language model memory. For instance, OpenAI’s GPT-4o features a memory function that retains preferences but does not store actual prompts and responses in the same way as prompt caching does.