Lessons from building Claude Code: Prompt caching is everything
Summary
Anthropic's Claude Code team shares best practices for optimizing prompt caching, a critical technique for reducing latency and cost in long-running agentic AI products. Prompt caching works by prefix matching, where the API reuses computation from previous requests if the initial part of the prompt remains identical. The article emphasizes structuring prompts with static content first, followed by dynamic elements like conversation messages, to maximize cache hit rates. Key strategies include using messages for updates instead of modifying the system prompt, avoiding mid-session changes to models or tool sets, and implementing cache-safe forking for operations like conversation compaction. The team monitors cache hit rates as a critical metric, treating low rates as severe incidents due to their impact on cost and user experience.
Key takeaway
For AI Engineers building agentic applications, optimizing prompt caching is paramount for managing operational costs and ensuring responsive user experiences. You should design your agent's prompt structure from the outset to prioritize cache hits by placing static elements before dynamic ones. Avoid changing models or tool sets mid-session, as this invalidates the cache; instead, use messages for updates and employ techniques like tool search with `defer_loading` to maintain prefix stability. Monitor your cache hit rate diligently, as even small drops can significantly impact expenses and performance.
Key insights
Prompt caching, based on prefix matching, is essential for cost-effective, low-latency agentic AI applications.
Principles
- Static content first, dynamic content last.
- Cache hit rate impacts cost and latency.
- Prefix changes invalidate the cache.
Method
Structure prompts with static system instructions and tools, followed by project context, session context, and conversation messages. Use messages for updates and defer tool loading to maintain a stable cached prefix.
In practice
- Organize prompts: static system, CLAUDE.md, session context, messages.
- Use `defer_loading` for tools to maintain cache.
- Implement cache-safe forking for summarization.
Topics
- Prompt Caching
- Claude Code
- Agentic Products
- Prefix Matching
- Context Window Compaction
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Claude Blog.