Prompt caching - Mistral AI
Summary
Prompt caching allows for the reuse of previously computed prompt tokens when API requests share an identical prefix, significantly reducing costs and latency. Cached prompt tokens are billed at 10% of the standard input token price. This feature is particularly beneficial for multi-turn conversations, applications with repeated system prompts, fill-in-the-middle requests, and agent completion requests that maintain consistent context. To enable caching, users set a stable application-level identifier, such as a conversation or session ID, as the `prompt_cache_key` in their requests. The API reports cached token usage in the `usage.prompt_tokens_details.cached_tokens` field of the completion response. Cache blocks are 64 tokens in size, meaning prompts with fewer than 64 tokens will not benefit from caching.
Key takeaway
For AI Engineers managing LLM inference costs and latency, implementing prompt caching with `prompt_cache_key` can yield substantial savings and performance gains. You should identify workloads with repeated prompt prefixes, such as conversational agents or applications using consistent system instructions, and integrate a stable application-level identifier. Monitor `usage.prompt_tokens_details.cached_tokens` to verify cache effectiveness and optimize your billing.
Key insights
Prompt caching reuses shared prompt prefixes to reduce LLM inference costs and latency.
Principles
- Cached tokens cost 10% of standard input.
- Cache hits reduce response latency.
- Cache blocks are 64 tokens.
Method
Set a consistent `prompt_cache_key` for requests sharing a prefix, like a conversation ID, to enable prompt caching and track `cached_tokens` in usage details.
In practice
- Use for multi-turn chat sessions.
- Apply to repeated system prompts.
- Avoid for short or unrelated prompts.
Topics
- Prompt Caching
- Mistral AI API
- Token Billing Optimization
- prompt_cache_key
- Multi-turn Chat
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by mistral.ai via Google News.