Lessons from building Claude Code: Prompt caching is everything

2026-04-30 · Source: Claude Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

Anthropic's Claude Code team shares best practices for optimizing prompt caching, a critical technique for reducing latency and cost in long-running agentic AI products. Prompt caching works by prefix matching, where the API reuses computation from previous requests if the initial part of the prompt remains identical. The article emphasizes structuring prompts with static content first, followed by dynamic elements like conversation messages, to maximize cache hit rates. Key strategies include using messages for updates instead of modifying the system prompt, avoiding mid-session changes to models or tool sets, and implementing cache-safe forking for operations like conversation compaction. The team monitors cache hit rates as a critical metric, treating low rates as severe incidents due to their impact on cost and user experience.

Key takeaway

For AI Engineers building agentic applications, optimizing prompt caching is paramount for managing operational costs and ensuring responsive user experiences. You should design your agent's prompt structure from the outset to prioritize cache hits by placing static elements before dynamic ones. Avoid changing models or tool sets mid-session, as this invalidates the cache; instead, use messages for updates and employ techniques like tool search with `defer_loading` to maintain prefix stability. Monitor your cache hit rate diligently, as even small drops can significantly impact expenses and performance.

Key insights

Prompt caching, based on prefix matching, is essential for cost-effective, low-latency agentic AI applications.

Principles

Static content first, dynamic content last.
Cache hit rate impacts cost and latency.
Prefix changes invalidate the cache.

Method

Structure prompts with static system instructions and tools, followed by project context, session context, and conversation messages. Use messages for updates and defer tool loading to maintain a stable cached prefix.

In practice

Organize prompts: static system, CLAUDE.md, session context, messages.
Use `defer_loading` for tools to maintain cache.
Implement cache-safe forking for summarization.

Topics

Prompt Caching
Claude Code
Agentic Products
Prefix Matching
Context Window Compaction

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Claude Blog.