Prompt Caching / KV Cache
Dramatically reduce costs and latency by reusing computed attention states across requests with matching prefixes.
How KV Caching Works
During inference, transformer models compute Key and Value matrices for every token in the context. This is the most expensive part of processing. KV caching stores these matrices so they can be reused when the same prefix appears again.
WITHOUT CACHING WITH CACHING
─────────────── ────────────
Request 1: Request 1:
┌──────────────────────┐ ┌──────────────────────┐
│ System: "You are..." │ ◄─ Compute │ System: "You are..." │ ◄─ Compute
│ User: "Hello" │ K,V │ User: "Hello" │ K,V
└──────────────────────┘ └──────────────────────┘
│
▼
┌──────────────────────┐
│ KV CACHE │
│ Store K,V matrices │
└──────────────────────┘
Request 2: Request 2:
┌──────────────────────┐ ┌──────────────────────┐
│ System: "You are..." │ ◄─ Compute │ System: "You are..." │ ◄─ Cache HIT!
│ User: "Hi there" │ K,V AGAIN │ User: "Hi there" │ (reuse K,V)
└──────────────────────┘ └──────────────────────┘
│
Cost: Full price ▼
Only compute K,V for "Hi there"
Cost: 90% reduction on cached portion Key Insight
Provider Implementations
| Provider | Mechanism | Discount | TTL |
|---|---|---|---|
| Anthropic | Explicit cache_control markers | 90% off cached reads | 5 minutes (ephemeral) |
| OpenAI | Automatic prefix matching | 50% off cached tokens | 5-10 minutes |
| Context caching API | Variable by model | Configurable | |
| Self-hosted | vLLM/TGI prefix caching | N/A (latency benefit) | Memory-dependent |
How different providers implement prompt caching
When to Use Explicit vs Automatic Caching
- Explicit (Anthropic): More control, can cache specific sections, clear cost visibility
- Automatic (OpenAI): No code changes needed, works transparently
- Either: Requires careful prompt structure for maximum benefit
Impact: Cost and Latency
COST IMPACT (Example: 50K token system prompt) Without Caching: ├── Request 1: 50,000 tokens × $0.003/1K = $0.15 ├── Request 2: 50,000 tokens × $0.003/1K = $0.15 ├── Request 3: 50,000 tokens × $0.003/1K = $0.15 └── Total: $0.45 With Caching (90% discount on cached): ├── Request 1: 50,000 tokens × $0.003/1K = $0.15 (cache created) ├── Request 2: 50,000 tokens × $0.0003/1K = $0.015 (cache hit) ├── Request 3: 50,000 tokens × $0.0003/1K = $0.015 (cache hit) └── Total: $0.18 (60% savings) At scale (1000 requests): ├── Without: $150.00 ├── With: $15.15 └── Savings: $134.85 (90%) ───────────────────────────────────────────────────────── LATENCY IMPACT Without Caching: Time to First Token (TTFT): ~2-3 seconds (compute all K,V) With Caching: Time to First Token (TTFT): ~0.3-0.5 seconds (only new tokens) Improvement: 70-85% reduction in TTFT
| Use Case | Cached Prefix Size | Cost Reduction | TTFT Reduction |
|---|---|---|---|
| RAG with fixed docs | 20-50K tokens | 70-85% | 60-80% |
| Agent with many tools | 10-30K tokens | 50-75% | 50-70% |
| Multi-turn chat | 5-15K tokens | 40-60% | 40-60% |
| Code assistant | 30-100K tokens | 80-90% | 70-85% |
Measured impact metrics by use case
Maximum Benefit Scenarios
Basic Implementation
Here's how to implement prompt caching with different providers:
# Prompt caching works by reusing computed attention states
# WITHOUT CACHING (every request computes full attention)
function generateWithoutCache(systemPrompt, userMessage):
fullContext = systemPrompt + userMessage
# Compute attention for ALL tokens every time
# Cost: O(n²) where n = total tokens
kvStates = computeAttention(fullContext)
return generateFromStates(kvStates)
# WITH CACHING (reuse attention for stable prefix)
function generateWithCache(systemPrompt, userMessage):
cacheKey = hash(systemPrompt)
if cache.has(cacheKey):
# Reuse precomputed attention states
prefixStates = cache.get(cacheKey)
# Only compute attention for new tokens
newStates = computeIncrementalAttention(
prefixStates,
userMessage
)
else:
# First time: compute and cache
prefixStates = computeAttention(systemPrompt)
cache.set(cacheKey, prefixStates)
newStates = computeIncrementalAttention(
prefixStates,
userMessage
)
return generateFromStates(newStates)
# Key insight: The "prefix" must be IDENTICAL for cache hit
# Even a single character difference = cache miss from anthropic import Anthropic
# Anthropic's prompt caching example
# Uses special cache_control markers
client = Anthropic()
# System prompt designed for caching
# Large, stable content that doesn't change
SYSTEM_PROMPT = """You are an expert assistant for our
e-commerce platform. Here is our complete product catalog:
[... imagine 50,000 tokens of product data ...]
Use this catalog to answer customer questions accurately.
Always cite specific product IDs when making recommendations.
"""
def query_with_caching(user_question: str) -> str:
"""Query with prompt caching enabled."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_question}
]
)
# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
# First call: cache_creation_input_tokens = ~50,000
# cache_read_input_tokens = 0
# Subsequent: cache_creation_input_tokens = 0
# cache_read_input_tokens = ~50,000
return response.content[0].text
# OpenAI approach (automatic, prefix-based)
from openai import OpenAI
openai_client = OpenAI()
def query_with_openai_caching(user_question: str) -> str:
"""OpenAI caches automatically based on prefix matching."""
# Same system prompt = cache hit
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": SYSTEM_PROMPT # Must be identical
},
{
"role": "user",
"content": user_question
}
]
)
# OpenAI reports cached tokens in usage
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
return response.choices[0].message.content using Anthropic;
using OpenAI;
using OpenAI.Chat;
// Anthropic approach with explicit cache control
public class AnthropicCachingExample
{
private readonly AnthropicClient _client;
private const string SystemPrompt = """
You are an expert assistant for our platform.
Here is our complete documentation:
[... large stable content ...]
""";
public async Task<string> QueryWithCachingAsync(string question)
{
var message = await _client.Messages.CreateAsync(
model: "claude-sonnet-4-20250514",
maxTokens: 1024,
system: new[]
{
new SystemMessage
{
Text = SystemPrompt,
CacheControl = new CacheControl { Type = "ephemeral" }
}
},
messages: new[]
{
new Message { Role = "user", Content = question }
}
);
// Log cache performance
Console.WriteLine($"Cache read: {message.Usage.CacheReadInputTokens}");
Console.WriteLine($"Cache created: {message.Usage.CacheCreationInputTokens}");
return message.Content[0].Text;
}
}
// OpenAI automatic prefix caching
public class OpenAICachingExample
{
private readonly ChatClient _client;
private const string SystemPrompt = """
You are an expert assistant for our platform.
Here is our complete documentation:
[... large stable content ...]
""";
public OpenAICachingExample(string apiKey)
{
_client = new ChatClient("gpt-4o", apiKey);
}
public async Task<string> QueryAsync(string question)
{
// OpenAI automatically caches matching prefixes
var messages = new List<ChatMessage>
{
ChatMessage.CreateSystemMessage(SystemPrompt),
ChatMessage.CreateUserMessage(question)
};
var response = await _client.CompleteChatAsync(messages);
// Check for cached tokens in usage
var usage = response.Value.Usage;
Console.WriteLine($"Input tokens: {usage.InputTokenCount}");
// Cached tokens reported in InputTokenDetails
return response.Value.Content[0].Text;
}
}
// Optimizing for cache hits
public class CacheOptimizedAgent
{
private readonly ChatClient _client;
// Keep system prompt EXACTLY the same across calls
private static readonly string _systemPrompt = BuildSystemPrompt();
private static string BuildSystemPrompt()
{
// Load once, reuse always
var docs = File.ReadAllText("documentation.txt");
var tools = File.ReadAllText("tool_definitions.json");
// Consistent ordering is crucial for cache hits
return $"""
# Documentation
{docs}
# Available Tools
{tools}
# Instructions
Help users with their questions using the above context.
""";
}
public async Task<string> ProcessAsync(
List<ChatMessage> conversationHistory,
string newMessage)
{
// Build messages with stable prefix
var messages = new List<ChatMessage>
{
// 1. System prompt (stable - cacheable)
ChatMessage.CreateSystemMessage(_systemPrompt)
};
// 2. Add conversation history
messages.AddRange(conversationHistory);
// 3. Add new message
messages.Add(ChatMessage.CreateUserMessage(newMessage));
var response = await _client.CompleteChatAsync(messages);
return response.Value.Content[0].Text;
}
} Multi-Turn Conversation Caching
In multi-turn conversations, the system prompt and tool definitions remain stable while the conversation history grows. Structure your requests to maximize cache hits:
Turn 1: ┌────────────────────────────────────────────────────┐ │ [CACHE CREATED] │ │ ┌──────────────────────────────────────────────┐ │ │ │ System prompt (50K tokens) │ │ │ │ Tool definitions (5K tokens) │ │ │ └──────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────┐ │ │ │ User: "Hello" │ │ │ └──────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────┘ Turn 2: ┌────────────────────────────────────────────────────┐ │ [CACHE HIT] │ │ ┌──────────────────────────────────────────────┐ │ │ │ System prompt (50K tokens) ✓ CACHED │ │ │ │ Tool definitions (5K tokens) ✓ CACHED │ │ │ └──────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────┐ │ │ │ User: "Hello" │ │ │ │ Assistant: "Hi! How can I help?" │ │ │ │ User: "What's the weather?" │ │ │ └──────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────┘ Cost breakdown: - Turn 1: 55K tokens at full price (cache creation) - Turn 2: 55K tokens at 90% discount + ~50 new tokens full price - Turn 3+: Same pattern, savings compound
# Multi-turn conversation with optimal caching
class CachedConversation:
def __init__(systemPrompt, tools):
self.systemPrompt = systemPrompt # Large, stable
self.tools = tools # Tool definitions
self.messages = [] # Conversation history
function addUserMessage(content):
self.messages.append({role: "user", content: content})
function addAssistantMessage(content):
self.messages.append({role: "assistant", content: content})
function getResponse():
# Structure for optimal caching:
# [CACHED PREFIX] | [DYNAMIC SUFFIX]
#
# Cached (stable across turns):
# - System prompt
# - Tool definitions
#
# Dynamic (changes each turn):
# - Conversation history
# - Current message
response = llm.generate(
system: [
{
text: self.systemPrompt,
cacheControl: "ephemeral" # Mark for caching
},
{
text: formatTools(self.tools),
cacheControl: "ephemeral"
}
],
messages: self.messages # Dynamic part
)
self.addAssistantMessage(response.content)
return response
# Cache behavior over conversation:
#
# Turn 1: Cache MISS (creates cache for system + tools)
# Cost: Full price for all tokens
#
# Turn 2: Cache HIT on system + tools
# Cost: Reduced for cached portion, full for new messages
#
# Turn 3+: Cache HIT continues
# Savings compound as prefix stays stable from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import List, Dict
@dataclass
class CachedConversation:
"""Multi-turn conversation optimized for prompt caching."""
client: Anthropic
system_prompt: str
tool_definitions: str
model: str = "claude-sonnet-4-20250514"
messages: List[Dict] = field(default_factory=list)
total_cache_reads: int = 0
total_cache_creates: int = 0
def chat(self, user_message: str) -> str:
"""Send message and get response with caching."""
self.messages.append({
"role": "user",
"content": user_message
})
response = self.client.messages.create(
model=self.model,
max_tokens=2048,
system=[
# Large system prompt - cached
{
"type": "text",
"text": self.system_prompt,
"cache_control": {"type": "ephemeral"}
},
# Tool definitions - cached
{
"type": "text",
"text": self.tool_definitions,
"cache_control": {"type": "ephemeral"}
}
],
messages=self.messages
)
# Track cache performance
usage = response.usage
self.total_cache_reads += usage.cache_read_input_tokens
self.total_cache_creates += usage.cache_creation_input_tokens
assistant_message = response.content[0].text
self.messages.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
def get_cache_stats(self) -> dict:
"""Get cumulative cache statistics."""
return {
"total_cache_reads": self.total_cache_reads,
"total_cache_creates": self.total_cache_creates,
"estimated_savings": self._calculate_savings()
}
def _calculate_savings(self) -> str:
"""Estimate cost savings from caching."""
# Anthropic pricing: cached reads are 90% cheaper
if self.total_cache_reads == 0:
return "0%"
full_cost = self.total_cache_reads + self.total_cache_creates
actual_cost = (
self.total_cache_creates +
self.total_cache_reads * 0.1 # 90% discount
)
savings = (1 - actual_cost / full_cost) * 100
return f"{savings:.1f}%"
# Usage example
def demo_caching():
client = Anthropic()
# Load large context once
system_prompt = open("company_knowledge_base.txt").read()
tools = open("tool_definitions.json").read()
conv = CachedConversation(
client=client,
system_prompt=system_prompt, # ~50K tokens
tool_definitions=tools # ~5K tokens
)
# Turn 1: Cache created
print(conv.chat("What products do we offer?"))
print(f"Stats after turn 1: {conv.get_cache_stats()}")
# Turn 2: Cache hit on system + tools
print(conv.chat("Tell me more about the premium tier"))
print(f"Stats after turn 2: {conv.get_cache_stats()}")
# Turn 3+: Continued cache hits
print(conv.chat("How does pricing compare to competitors?"))
print(f"Stats after turn 3: {conv.get_cache_stats()}")
# Expected output:
# Stats after turn 1: {'total_cache_reads': 0, 'total_cache_creates': 55000, ...}
# Stats after turn 2: {'total_cache_reads': 55000, 'total_cache_creates': 55000, ...}
# Stats after turn 3: {'total_cache_reads': 110000, 'total_cache_creates': 55000, ...} using Anthropic;
public class CachedConversation
{
private readonly AnthropicClient _client;
private readonly string _systemPrompt;
private readonly string _toolDefinitions;
private readonly List<Message> _messages = new();
private long _totalCacheReads = 0;
private long _totalCacheCreates = 0;
public CachedConversation(
AnthropicClient client,
string systemPrompt,
string toolDefinitions)
{
_client = client;
_systemPrompt = systemPrompt;
_toolDefinitions = toolDefinitions;
}
public async Task<string> ChatAsync(string userMessage)
{
_messages.Add(new Message
{
Role = "user",
Content = userMessage
});
var response = await _client.Messages.CreateAsync(
model: "claude-sonnet-4-20250514",
maxTokens: 2048,
system: new[]
{
// Cached: system prompt
new SystemMessage
{
Text = _systemPrompt,
CacheControl = new CacheControl { Type = "ephemeral" }
},
// Cached: tool definitions
new SystemMessage
{
Text = _toolDefinitions,
CacheControl = new CacheControl { Type = "ephemeral" }
}
},
messages: _messages.ToArray()
);
// Track cache performance
_totalCacheReads += response.Usage.CacheReadInputTokens;
_totalCacheCreates += response.Usage.CacheCreationInputTokens;
var assistantMessage = response.Content[0].Text;
_messages.Add(new Message
{
Role = "assistant",
Content = assistantMessage
});
return assistantMessage;
}
public CacheStats GetCacheStats()
{
var fullCost = _totalCacheReads + _totalCacheCreates;
var actualCost = _totalCacheCreates + _totalCacheReads * 0.1;
var savings = fullCost > 0
? (1 - actualCost / fullCost) * 100
: 0;
return new CacheStats
{
TotalCacheReads = _totalCacheReads,
TotalCacheCreates = _totalCacheCreates,
EstimatedSavings = $"{savings:F1}%"
};
}
}
public record CacheStats
{
public long TotalCacheReads { get; init; }
public long TotalCacheCreates { get; init; }
public string EstimatedSavings { get; init; }
}
// Usage
public async Task DemoCachingAsync()
{
var client = new AnthropicClient(apiKey);
var systemPrompt = await File.ReadAllTextAsync("knowledge_base.txt");
var tools = await File.ReadAllTextAsync("tools.json");
var conv = new CachedConversation(client, systemPrompt, tools);
// Turn 1: Cache created
Console.WriteLine(await conv.ChatAsync("What products do we offer?"));
Console.WriteLine($"Stats: {conv.GetCacheStats()}");
// Turn 2+: Cache hits
Console.WriteLine(await conv.ChatAsync("Tell me about premium"));
Console.WriteLine($"Stats: {conv.GetCacheStats()}");
} Optimization Strategies
Maximize cache hit rates by carefully structuring your prompts:
MOST STABLE (highest cache benefit)
│
▼
┌───────────────────────────────────────┐
│ 1. Static Instructions │ ← Never changes
│ "You are a helpful assistant..." │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 2. Reference Data / Knowledge Base │ ← Changes rarely
│ Documentation, product catalog... │ (daily/weekly)
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 3. Tool Definitions │ ← Changes occasionally
│ Sorted alphabetically for │ (with deployments)
│ deterministic ordering │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 4. Session Context │ ← Changes per session
│ User preferences, auth info... │ (no caching)
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 5. Conversation History │ ← Changes per turn
│ Previous messages... │ (no caching)
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 6. Current Query │ ← Changes every request
│ User's current message │ (no caching)
└───────────────────────────────────────┘
│
▼
LEAST STABLE (no cache benefit) # Strategies for maximizing cache hit rate
# Strategy 1: Consistent prompt ordering
function buildOptimalPrompt(components):
# Order from most stable to least stable
# Cache prefix ends at first difference
prompt = []
# 1. Static instructions (never changes)
prompt.append(components.staticInstructions)
# 2. Reference data (changes rarely)
prompt.append(components.referenceData)
# 3. Tool definitions (changes occasionally)
prompt.append(components.toolDefinitions)
# 4. Session context (changes per session)
prompt.append(components.sessionContext)
# 5. Conversation history (changes per turn)
prompt.append(components.conversationHistory)
# 6. Current query (changes every request)
prompt.append(components.currentQuery)
return prompt
# Strategy 2: Chunked caching for partial updates
function buildChunkedPrompt(components):
# Separate cache breakpoints for different TTLs
return [
{
text: components.staticInstructions,
cacheControl: "persistent" # Long TTL
},
{
text: components.referenceData,
cacheControl: "ephemeral" # Session TTL
},
{
# No cache control = not cached
text: components.dynamicContent
}
]
# Strategy 3: Normalizing dynamic content
function normalizeForCaching(content):
# Remove variation that breaks cache
# Bad: "Current time: 2024-01-15 14:32:05"
# Good: "Current date: 2024-01-15" (updated daily)
# Bad: "Request ID: abc123"
# Good: Move request ID to message, not system
# Bad: Random ordering of examples
# Good: Deterministic ordering
return removeUnnecessaryVariation(content) from dataclasses import dataclass
from typing import List, Optional
import hashlib
import json
@dataclass
class CacheOptimizedPromptBuilder:
"""Build prompts optimized for cache hit rates."""
static_instructions: str
reference_data: str
tool_definitions: List[dict]
def build_system_content(
self,
session_context: Optional[str] = None
) -> List[dict]:
"""Build system content with optimal cache structure."""
content = []
# Layer 1: Static instructions (highest cache stability)
# This NEVER changes - maximum cache benefit
content.append({
"type": "text",
"text": self._normalize(self.static_instructions),
"cache_control": {"type": "ephemeral"}
})
# Layer 2: Reference data (changes rarely)
# Updated daily/weekly - still high cache benefit
content.append({
"type": "text",
"text": self._normalize(self.reference_data),
"cache_control": {"type": "ephemeral"}
})
# Layer 3: Tool definitions (normalized for consistency)
# Sort tools alphabetically for deterministic ordering
sorted_tools = sorted(
self.tool_definitions,
key=lambda t: t.get("name", "")
)
tools_text = json.dumps(sorted_tools, sort_keys=True)
content.append({
"type": "text",
"text": f"Available tools:\n{tools_text}",
"cache_control": {"type": "ephemeral"}
})
# Layer 4: Session context (if any, no cache control)
# Changes per session - don't cache
if session_context:
content.append({
"type": "text",
"text": session_context
# No cache_control = not cached
})
return content
def _normalize(self, text: str) -> str:
"""Normalize text to maximize cache hits."""
# Remove trailing whitespace that might vary
lines = [line.rstrip() for line in text.split('\n')]
# Consistent line endings
normalized = '\n'.join(lines)
# Remove any dynamic elements that shouldn't be there
# (timestamps, request IDs, etc.)
return normalized
def get_cache_key(self) -> str:
"""Get a hash representing the cacheable prefix."""
# Useful for debugging cache behavior
cacheable = (
self._normalize(self.static_instructions) +
self._normalize(self.reference_data) +
json.dumps(sorted(
self.tool_definitions,
key=lambda t: t.get("name", "")
), sort_keys=True)
)
return hashlib.sha256(cacheable.encode()).hexdigest()[:16]
class DynamicContentIsolator:
"""Isolate dynamic content to preserve cache prefix."""
@staticmethod
def separate_timestamp(message: str) -> tuple[str, dict]:
"""Move timestamp from message to metadata."""
# Instead of: "As of 2024-01-15 14:32, the status is..."
# Use: "The current status is..." + metadata
import re
from datetime import datetime
# Extract and remove timestamps
timestamp_pattern = r'\b\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}(:\d{2})?\b'
timestamps = re.findall(timestamp_pattern, message)
clean_message = re.sub(timestamp_pattern, '[TIMESTAMP]', message)
metadata = {
"extracted_timestamps": timestamps,
"processed_at": datetime.now().isoformat()
}
return clean_message, metadata
@staticmethod
def deterministic_examples(
examples: List[dict],
max_examples: int = 5
) -> List[dict]:
"""Select examples deterministically for caching."""
# Bad: random.sample(examples, 5)
# Good: deterministic selection
# Sort by a stable key
sorted_examples = sorted(
examples,
key=lambda e: e.get("id", str(e))
)
# Take first N (or use content hash for variety)
return sorted_examples[:max_examples]
# Usage example
def create_cached_agent():
# Load stable content once
with open("instructions.md") as f:
instructions = f.read()
with open("knowledge_base.txt") as f:
knowledge = f.read()
with open("tools.json") as f:
tools = json.load(f)
builder = CacheOptimizedPromptBuilder(
static_instructions=instructions,
reference_data=knowledge,
tool_definitions=tools
)
print(f"Cache key: {builder.get_cache_key()}")
# Build system content for API call
system_content = builder.build_system_content(
session_context="User is a premium subscriber"
)
return system_content using System.Security.Cryptography;
using System.Text;
using System.Text.Json;
public class CacheOptimizedPromptBuilder
{
private readonly string _staticInstructions;
private readonly string _referenceData;
private readonly List<ToolDefinition> _toolDefinitions;
public CacheOptimizedPromptBuilder(
string staticInstructions,
string referenceData,
List<ToolDefinition> toolDefinitions)
{
_staticInstructions = staticInstructions;
_referenceData = referenceData;
_toolDefinitions = toolDefinitions;
}
public List<SystemContent> BuildSystemContent(
string? sessionContext = null)
{
var content = new List<SystemContent>();
// Layer 1: Static instructions (highest stability)
content.Add(new SystemContent
{
Text = Normalize(_staticInstructions),
CacheControl = new CacheControl { Type = "ephemeral" }
});
// Layer 2: Reference data
content.Add(new SystemContent
{
Text = Normalize(_referenceData),
CacheControl = new CacheControl { Type = "ephemeral" }
});
// Layer 3: Tool definitions (sorted for consistency)
var sortedTools = _toolDefinitions
.OrderBy(t => t.Name)
.ToList();
var toolsJson = JsonSerializer.Serialize(sortedTools,
new JsonSerializerOptions { WriteIndented = false });
content.Add(new SystemContent
{
Text = $"Available tools:\n{toolsJson}",
CacheControl = new CacheControl { Type = "ephemeral" }
});
// Layer 4: Session context (not cached)
if (!string.IsNullOrEmpty(sessionContext))
{
content.Add(new SystemContent
{
Text = sessionContext
// No CacheControl = not cached
});
}
return content;
}
private static string Normalize(string text)
{
var lines = text.Split('\n')
.Select(line => line.TrimEnd());
return string.Join("\n", lines);
}
public string GetCacheKey()
{
var sortedTools = _toolDefinitions
.OrderBy(t => t.Name)
.ToList();
var cacheable = Normalize(_staticInstructions) +
Normalize(_referenceData) +
JsonSerializer.Serialize(sortedTools);
using var sha256 = SHA256.Create();
var hash = sha256.ComputeHash(Encoding.UTF8.GetBytes(cacheable));
return Convert.ToHexString(hash)[..16];
}
}
public static class DynamicContentIsolator
{
public static (string CleanMessage, Dictionary<string, object> Metadata)
SeparateTimestamp(string message)
{
var timestampPattern = @"\b\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}(:\d{2})?\b";
var matches = Regex.Matches(message, timestampPattern);
var cleanMessage = Regex.Replace(
message, timestampPattern, "[TIMESTAMP]");
var metadata = new Dictionary<string, object>
{
["extracted_timestamps"] = matches.Select(m => m.Value).ToList(),
["processed_at"] = DateTime.UtcNow.ToString("O")
};
return (cleanMessage, metadata);
}
public static List<T> DeterministicExamples<T>(
List<T> examples,
Func<T, string> keySelector,
int maxExamples = 5)
{
// Deterministic selection for caching
return examples
.OrderBy(keySelector)
.Take(maxExamples)
.ToList();
}
}
public record SystemContent
{
public string Text { get; init; } = "";
public CacheControl? CacheControl { get; init; }
}
public record CacheControl
{
public string Type { get; init; } = "ephemeral";
}
public record ToolDefinition
{
public string Name { get; init; } = "";
public string Description { get; init; } = "";
} | Pattern | Problem | Fix |
|---|---|---|
| Timestamps in prompt | Changes every second/minute | Move to metadata or round to day |
| Request IDs in system | Unique per request | Move to message, not system prompt |
| Random example order | Different each time | Sort deterministically |
| User name in system | Breaks cache per user | Move to conversation, after cache boundary |
| Inconsistent whitespace | Trailing spaces vary | Normalize with .strip() |
Common cache-breaking patterns and fixes
Monitoring and Evaluation
Track cache performance to ensure you're getting the expected benefits:
| Metric | What it Measures | Target |
|---|---|---|
| Cache hit rate | % of requests using cached prefix | >90% for steady workloads |
| Cached token ratio | Cached tokens / total input tokens | Higher is better (varies by use case) |
| TTFT improvement | Latency reduction from caching | 50-80% for large prefixes |
| Cost savings | Actual vs theoretical cost | Track against baseline |
| Cache creation rate | New caches created / total requests | <10% indicates good stability |
Key metrics for cache performance
Debugging Cache Misses
# Anthropic: Check usage in response
response.usage.cache_read_input_tokens # Should be > 0 if cache hit
response.usage.cache_creation_input_tokens # > 0 means cache was created
# If cache_creation keeps happening on every request:
# 1. Check for dynamic content in system prompt
# 2. Verify exact byte-level match of cached content
# 3. Ensure requests happen within TTL (5 min)
# Compute cache key hash to verify stability
import hashlib
cache_key = hashlib.sha256(system_prompt.encode()).hexdigest()[:16]
print(f"Cache key: {cache_key}")
# If this changes between requests, you have a caching problem