Context Engineering
The discipline of optimizing what goes into the context window. Context engineering is replacing "prompt engineering" as the key skill for agent developers.
Why Context Engineering Matters
As context windows grow larger (128K, 200K, even 1M tokens), the challenge shifts from "fitting everything in" to "including the right things." Poor context management leads to:
- Performance degradation — models attend poorly to irrelevant content
- Higher costs — more tokens processed per request
- Slower responses — latency scales with context size
- Confused reasoning — conflicting or outdated information
The New Paradigm
The Four Strategies
┌─────────────────────────────────────────────────────────────┐ │ CONTEXT WINDOW │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ WRITE │ │ SELECT │ │ COMPRESS │ │ ISOLATE │ │ │ │ │ │ │ │ │ │ │ │ │ │Scratchpad│ │ Retrieve │ │Summarize │ │ Separate │ │ │ │ Working │ │ relevant │ │Dedupe │ │ concerns │ │ │ │ memory │ │ only │ │ Prune │ │into parts│ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ Add useful Filter out Reduce size Multiple │ │ intermediate irrelevant of existing focused │ │ state content content contexts │ └─────────────────────────────────────────────────────────────┘
| Strategy | Purpose | When to Use |
|---|---|---|
| Write | Add scratchpads and working memory to context | Multi-step reasoning, accumulating findings |
| Select | Retrieve only relevant information | Large knowledge bases, RAG scenarios |
| Compress | Summarize, deduplicate, prune content | Long conversations, large tool outputs |
| Isolate | Separate concerns into different contexts | Complex workflows, planning vs execution |
Overview of the four context engineering strategies
Strategy 1: Write
The Write strategy adds working memory to the context — scratchpads, intermediate results, and accumulated knowledge that helps the agent maintain coherence across steps.
Step 1 Step 2 Step 3
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Agent │ │ Agent │ │ Agent │
│ thinks │ │ thinks │ │ thinks │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────┐
│ SCRATCHPAD │
│ │
│ - User wants to analyze Q4 sales data │
│ - Found 3 relevant CSV files in /data/sales/ │
│ - Total records: 45,000 across all files │
│ - Key finding: Revenue up 23% YoY │
│ - Pending: Generate visualization │
└──────────────────────────────────────────────────────────┘ function agentWithScratchpad(task, tools):
# Initialize scratchpad in context
scratchpad = ""
messages = [
systemPrompt(tools),
userMessage(task)
]
while not complete:
response = llm.generate(messages)
if response.hasToolCall:
result = executeTools(response.toolCalls)
# Write key findings to scratchpad
scratchpad = updateScratchpad(scratchpad, result)
# Include scratchpad in next iteration
messages.append(assistantMessage(response))
messages.append(toolResult(result))
messages.append(systemMessage(
"Current scratchpad:\n" + scratchpad
))
else:
return response.content
function updateScratchpad(current, newInfo):
# Extract key facts, discard noise
keyFacts = extractKeyFacts(newInfo)
# Merge with existing, avoiding duplicates
return deduplicate(current + "\n" + keyFacts) from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
class AgentState(TypedDict):
messages: list
scratchpad: str # Working memory within context
task_complete: bool
llm = ChatOpenAI(model="gpt-4")
def think_and_act(state: AgentState) -> AgentState:
"""Agent step with scratchpad for working memory."""
# Include scratchpad in the prompt
scratchpad_prompt = f"""
Current Scratchpad (key findings so far):
{state['scratchpad'] or 'Empty - no findings yet'}
Use this scratchpad to track important discoveries.
Update it with new relevant information.
"""
messages = state['messages'] + [
{"role": "system", "content": scratchpad_prompt}
]
response = llm.invoke(messages)
# Extract any scratchpad updates from response
new_scratchpad = extract_scratchpad_update(
response.content,
state['scratchpad']
)
return {
**state,
"messages": state['messages'] + [response],
"scratchpad": new_scratchpad
}
def extract_scratchpad_update(
response: str,
current: str
) -> str:
"""Extract key facts from response to update scratchpad."""
# Use LLM to summarize key findings
summary_prompt = f"""
Extract only the key facts from this response that
should be remembered for future steps. Be concise.
Response: {response}
Current scratchpad: {current}
Output just the updated scratchpad content:
"""
summary = llm.invoke([{"role": "user", "content": summary_prompt}])
return summary.content
# Build graph with scratchpad state
graph = StateGraph(AgentState)
graph.add_node("agent", think_and_act)
graph.add_edge(START, "agent")
# ... add conditional edges based on task completion using Microsoft.Extensions.AI;
public class ScratchpadAgent
{
private readonly IChatClient _client;
private StringBuilder _scratchpad = new();
public ScratchpadAgent(IChatClient client)
{
_client = client;
}
public async Task<string> RunAsync(string task)
{
var messages = new List<ChatMessage>
{
new(ChatRole.System, GetSystemPrompt()),
new(ChatRole.User, task)
};
while (true)
{
// Inject scratchpad into context
if (_scratchpad.Length > 0)
{
messages.Add(new(ChatRole.System,
$"Current scratchpad:\n{_scratchpad}"));
}
var response = await _client.GetResponseAsync(messages);
if (response.FinishReason == ChatFinishReason.ToolCalls)
{
var result = await ExecuteToolsAsync(response);
// Update scratchpad with key findings
await UpdateScratchpadAsync(result);
messages.Add(response.Messages.Last());
messages.Add(new(ChatRole.Tool, result));
}
else
{
return response.Text;
}
}
}
private async Task UpdateScratchpadAsync(string newInfo)
{
// Use LLM to extract and deduplicate key facts
var extractPrompt = $"""
Extract key facts to remember from:
{newInfo}
Current scratchpad:
{_scratchpad}
Output updated scratchpad (concise, no duplicates):
""";
var summary = await _client.GetResponseAsync(extractPrompt);
_scratchpad.Clear();
_scratchpad.Append(summary.Text);
}
} Best Practice
Strategy 2: Select
The Select strategy retrieves only the most relevant information for the current task. This is the foundation of RAG (Retrieval-Augmented Generation) but applies broadly to any context curation.
Query: "How do I handle API rate limits?"
│
▼
┌───────────────────────────────────┐
│ RELEVANCE SCORING │
│ │
│ Semantic Similarity × 0.6 │
│ (embedding distance) │
│ │
│ Recency Score × 0.2 │
│ (newer = more relevant) │
│ │
│ Importance Score × 0.2 │
│ (metadata-based) │
└───────────────────────────────────┘
│
▼
┌───────────────────────────────────┐
│ TOKEN BUDGET │
│ │
│ Max: 4000 tokens │
│ │
│ ✓ rate-limiting.md (0.92) 800t │
│ ✓ api-errors.md (0.84) 650t │
│ ✓ retry-patterns.md (0.78) 720t │
│ ✗ auth-setup.md (0.45) --- │
│ │
│ Total: 2170 tokens (under limit) │
└───────────────────────────────────┘ function selectRelevantContext(query, availableContext):
# Score each piece of context for relevance
scoredContext = []
for item in availableContext:
# Semantic similarity
similarity = embeddings.similarity(query, item.content)
# Recency boost (newer = more relevant)
recencyScore = calculateRecency(item.timestamp)
# Importance (based on metadata or past usage)
importance = item.metadata.importance or 1.0
finalScore = (similarity * 0.6) +
(recencyScore * 0.2) +
(importance * 0.2)
scoredContext.append({ item, finalScore })
# Sort by score and take top K items
scoredContext.sort(by: "finalScore", descending: true)
selected = scoredContext[:maxItems]
# Ensure we stay within token budget
return fitToTokenBudget(selected, maxTokens)
function fitToTokenBudget(items, maxTokens):
result = []
currentTokens = 0
for item in items:
itemTokens = countTokens(item.content)
if currentTokens + itemTokens <= maxTokens:
result.append(item)
currentTokens += itemTokens
else:
break
return result from typing import List, Dict, Any
import numpy as np
from datetime import datetime, timedelta
from sentence_transformers import SentenceTransformer
class ContextSelector:
def __init__(
self,
max_tokens: int = 4000,
embedding_model: str = "all-MiniLM-L6-v2"
):
self.max_tokens = max_tokens
self.embedder = SentenceTransformer(embedding_model)
def select(
self,
query: str,
context_items: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
"""Select most relevant context items for query."""
# Compute query embedding
query_embedding = self.embedder.encode(query)
# Score each item
scored_items = []
for item in context_items:
score = self._score_item(
query_embedding,
item
)
scored_items.append((item, score))
# Sort by score descending
scored_items.sort(key=lambda x: x[1], reverse=True)
# Select within token budget
return self._fit_to_budget(scored_items)
def _score_item(
self,
query_embedding: np.ndarray,
item: Dict[str, Any]
) -> float:
"""Compute relevance score for an item."""
# Semantic similarity (cosine)
item_embedding = self.embedder.encode(item["content"])
similarity = np.dot(query_embedding, item_embedding) / (
np.linalg.norm(query_embedding) *
np.linalg.norm(item_embedding)
)
# Recency score (decay over time)
if "timestamp" in item:
age = datetime.now() - item["timestamp"]
# Exponential decay: half-life of 7 days
recency = np.exp(-age.days / 7)
else:
recency = 0.5
# Importance from metadata
importance = item.get("importance", 1.0)
# Weighted combination
return (
similarity * 0.6 +
recency * 0.2 +
importance * 0.2
)
def _fit_to_budget(
self,
scored_items: List[tuple]
) -> List[Dict[str, Any]]:
"""Select items within token budget."""
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
selected = []
current_tokens = 0
for item, score in scored_items:
tokens = len(enc.encode(item["content"]))
if current_tokens + tokens <= self.max_tokens:
selected.append(item)
current_tokens += tokens
else:
break
return selected
# Usage
selector = ContextSelector(max_tokens=4000)
relevant = selector.select(
query="How do I handle API rate limits?",
context_items=all_docs
) using Microsoft.Extensions.AI;
using Microsoft.ML.Tokenizers;
using OpenAI;
public class ContextSelector
{
private readonly IEmbeddingGenerator<string, Embedding<float>> _embedder;
private readonly Tokenizer _tokenizer;
private readonly int _maxTokens;
public ContextSelector(string apiKey, int maxTokens = 4000)
{
_embedder = new OpenAIClient(apiKey)
.GetEmbeddingClient("text-embedding-3-small")
.AsIEmbeddingGenerator<string, Embedding<float>>();
_maxTokens = maxTokens;
_tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
}
public async Task<List<ContextItem>> SelectAsync(
string query,
List<ContextItem> items)
{
// Get query embedding
var queryResult = await _embedder.GenerateAsync(query);
var queryEmbedding = queryResult.Single().Vector;
// Score all items
var scored = new List<(ContextItem Item, float Score)>();
foreach (var item in items)
{
var itemResult = await _embedder.GenerateAsync(item.Content);
var itemEmbedding = itemResult.Single().Vector;
var score = ScoreItem(queryEmbedding, itemEmbedding, item);
scored.Add((item, score));
}
// Sort by score descending
scored.Sort((a, b) => b.Score.CompareTo(a.Score));
// Select within token budget
return FitToBudget(scored);
}
private float ScoreItem(
ReadOnlyMemory<float> queryEmb,
ReadOnlyMemory<float> itemEmb,
ContextItem item)
{
// Cosine similarity
var similarity = CosineSimilarity(queryEmb.Span, itemEmb.Span);
// Recency score
var age = DateTime.UtcNow - item.Timestamp;
var recency = (float)Math.Exp(-age.TotalDays / 7);
// Weighted combination
return similarity * 0.6f +
recency * 0.2f +
item.Importance * 0.2f;
}
private List<ContextItem> FitToBudget(
List<(ContextItem Item, float Score)> scored)
{
var selected = new List<ContextItem>();
var currentTokens = 0;
foreach (var (item, _) in scored)
{
var tokens = _tokenizer.CountTokens(item.Content);
if (currentTokens + tokens <= _maxTokens)
{
selected.Add(item);
currentTokens += tokens;
}
else break;
}
return selected;
}
private static float CosineSimilarity(
ReadOnlySpan<float> a,
ReadOnlySpan<float> b)
{
float dot = 0, normA = 0, normB = 0;
for (int i = 0; i < a.Length; i++)
{
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (MathF.Sqrt(normA) * MathF.Sqrt(normB));
}
} Pitfall: Over-Selection
Strategy 3: Compress
The Compress strategy reduces the size of existing context through summarization, deduplication, and pruning. Essential for long-running conversations and agents that accumulate large tool outputs.
BEFORE COMPRESSION (12,000 tokens)
┌────────────────────────────────────────────┐
│ Message 1: User asks about project setup │
│ Message 2: Agent explains 3 options │
│ Message 3: User chooses option B │
│ Message 4: Agent runs npm install │
│ Message 5: Tool output (3000 tokens!) │
│ Message 6: Agent summarizes result │
│ ... │
│ Message 20: Current question │
└────────────────────────────────────────────┘
│
│ Compress
▼
AFTER COMPRESSION (3,500 tokens)
┌────────────────────────────────────────────┐
│ Summary: "User setting up project with │
│ option B. npm install completed with │
│ 847 packages. Currently working on..." │
│ │
│ Message 18: [kept intact] │
│ Message 19: [kept intact] │
│ Message 20: Current question │
└────────────────────────────────────────────┘ function compressConversation(messages, targetTokens):
currentTokens = countTokens(messages)
if currentTokens <= targetTokens:
return messages
# Strategy 1: Summarize older messages
threshold = length(messages) * 0.6 # Keep recent 40% intact
oldMessages = messages[:threshold]
recentMessages = messages[threshold:]
summary = llm.summarize(
oldMessages,
instruction: "Summarize the key points, decisions,
and pending tasks from this conversation."
)
compressedMessages = [
systemMessage("Previous conversation summary:\n" + summary),
...recentMessages
]
# Strategy 2: If still too large, truncate tool outputs
if countTokens(compressedMessages) > targetTokens:
compressedMessages = truncateLargeToolOutputs(
compressedMessages,
maxOutputSize: 500
)
return compressedMessages
function truncateLargeToolOutputs(messages, maxOutputSize):
for message in messages:
if message.role == "tool" and
countTokens(message.content) > maxOutputSize:
message.content = summarizeToolOutput(
message.content,
maxOutputSize
)
return messages from typing import List
import tiktoken
from langchain_openai import ChatOpenAI
from langchain_core.messages import BaseMessage, SystemMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate
class ConversationCompressor:
def __init__(
self,
target_tokens: int = 8000,
preserve_recent_ratio: float = 0.4
):
self.llm = ChatOpenAI(model="gpt-4o-mini") # Fast model for compression
self.target_tokens = target_tokens
self.preserve_ratio = preserve_recent_ratio
self.enc = tiktoken.get_encoding("cl100k_base")
def compress(self, messages: List[BaseMessage]) -> List[BaseMessage]:
"""Compress conversation to fit token budget."""
current_tokens = self._count_tokens(messages)
if current_tokens <= self.target_tokens:
return messages
# Split into old and recent
split_idx = int(len(messages) * (1 - self.preserve_ratio))
old_messages = messages[:split_idx]
recent_messages = messages[split_idx:]
# Summarize old messages
summary = self._summarize(old_messages)
compressed = [
SystemMessage(content=f"Previous conversation summary:\n{summary}"),
*recent_messages
]
# If still too large, truncate tool outputs
if self._count_tokens(compressed) > self.target_tokens:
compressed = self._truncate_tool_outputs(compressed)
return compressed
def _summarize(self, messages: List[BaseMessage]) -> str:
"""Summarize messages using LangChain."""
formatted = "\n".join([
f"{m.type.upper()}: {m.content[:500]}"
for m in messages
])
prompt = ChatPromptTemplate.from_messages([
("user", """Summarize this conversation, preserving:
1. Key decisions made
2. Important facts discovered
3. Current task status
4. Any pending questions
Conversation:
{conversation}
Concise summary:""")
])
chain = prompt | self.llm
response = chain.invoke({"conversation": formatted})
return response.content
def _truncate_tool_outputs(
self, messages: List[BaseMessage], max_output_tokens: int = 300
) -> List[BaseMessage]:
"""Truncate large tool outputs."""
result = []
for msg in messages:
if msg.type == "tool":
tokens = len(self.enc.encode(str(msg.content)))
if tokens > max_output_tokens:
summary = self._summarize_output(msg.content)
result.append(msg.copy(update={"content": summary}))
else:
result.append(msg)
else:
result.append(msg)
return result
def _summarize_output(self, content: str) -> str:
"""Summarize a tool output."""
prompt = ChatPromptTemplate.from_messages([
("user", "Summarize this tool output in 2-3 sentences, "
"keeping essential data:\n{content}")
])
chain = prompt | self.llm
return chain.invoke({"content": content[:2000]}).content
def _count_tokens(self, messages: List[BaseMessage]) -> int:
"""Count tokens in messages."""
return sum(len(self.enc.encode(str(m.content))) for m in messages) using Microsoft.Extensions.AI;
using Microsoft.ML.Tokenizers;
public class ConversationCompressor
{
private readonly IChatClient _client;
private readonly Tokenizer _tokenizer;
private readonly int _targetTokens;
private readonly float _preserveRatio;
public ConversationCompressor(
IChatClient client,
int targetTokens = 8000,
float preserveRecentRatio = 0.4f)
{
_client = client;
_targetTokens = targetTokens;
_preserveRatio = preserveRecentRatio;
_tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
}
public async Task<List<ChatMessage>> CompressAsync(
List<ChatMessage> messages)
{
var currentTokens = CountTokens(messages);
if (currentTokens <= _targetTokens)
return messages;
// Split into old and recent
var splitIdx = (int)(messages.Count * (1 - _preserveRatio));
var oldMessages = messages.Take(splitIdx).ToList();
var recentMessages = messages.Skip(splitIdx).ToList();
// Summarize old messages
var summary = await SummarizeAsync(oldMessages);
var compressed = new List<ChatMessage>
{
new(ChatRole.System,
$"Previous conversation summary:\n{summary}")
};
compressed.AddRange(recentMessages);
// Truncate tool outputs if still too large
if (CountTokens(compressed) > _targetTokens)
{
compressed = await TruncateToolOutputsAsync(compressed);
}
return compressed;
}
private async Task<string> SummarizeAsync(
List<ChatMessage> messages)
{
var formatted = string.Join("\n",
messages.Select(m =>
$"{m.Role.ToString().ToUpper()}: " +
$"{Truncate(m.Text ?? "", 500)}"));
var prompt = $"""
Summarize this conversation, preserving:
1. Key decisions made
2. Important facts discovered
3. Current task status
Conversation:
{formatted}
Concise summary:
""";
var response = await _client.GetResponseAsync(prompt);
return response.Text;
}
private async Task<List<ChatMessage>> TruncateToolOutputsAsync(
List<ChatMessage> messages,
int maxOutputTokens = 300)
{
var result = new List<ChatMessage>();
foreach (var msg in messages)
{
if (msg.Role == ChatRole.Tool &&
_tokenizer.CountTokens(msg.Text ?? "") > maxOutputTokens)
{
var summary = await SummarizeOutputAsync(msg.Text!);
result.Add(new ChatMessage(ChatRole.Tool, summary));
}
else
{
result.Add(msg);
}
}
return result;
}
private async Task<string> SummarizeOutputAsync(string content)
{
var response = await _client.GetResponseAsync(
$"Summarize in 2-3 sentences: {Truncate(content, 2000)}");
return response.Text;
}
private int CountTokens(List<ChatMessage> messages) =>
messages.Sum(m => _tokenizer.CountTokens(m.Text ?? ""));
private static string Truncate(string s, int max) =>
s.Length <= max ? s : s[..max] + "...";
} | Technique | Token Reduction | Information Loss | Best For |
|---|---|---|---|
| Summarization | 70-90% | Medium | Older conversation history |
| Truncation | Variable | High (for cut content) | Tool outputs with known structure |
| Deduplication | 10-30% | None | Repeated information across sources |
| Selective retention | 50-80% | Low (if done well) | Mixed-importance content |
Compression techniques and their trade-offs
Strategy 4: Isolate
The Isolate strategy separates different concerns into distinct contexts. Instead of one massive context, use multiple focused contexts for different stages or aspects of the task.
┌─────────────────────────────────────────────────────────────┐
│ PLANNING CONTEXT │
│ │
│ System: "You are a planning agent. Break down tasks..." │
│ History: Previous plans and their outcomes │
│ Tools: NONE (planning only) │
│ │
│ Output: Step-by-step plan │
└─────────────────────────────────────────────────────────────┘
│
│ Plan steps
▼
┌─────────────────────────────────────────────────────────────┐
│ EXECUTION CONTEXT │
│ │
│ System: "Execute this step precisely..." │
│ History: Recent execution results only (bounded) │
│ Memory: Relevant facts retrieved on-demand │
│ Tools: All available tools │
│ │
│ Output: Step results, extracted facts │
└─────────────────────────────────────────────────────────────┘
│
│ Facts
▼
┌─────────────────────────────────────────────────────────────┐
│ MEMORY CONTEXT │
│ │
│ Long-term storage of extracted facts │
│ Searchable by relevance to current step │
│ Persists across conversation sessions │
└─────────────────────────────────────────────────────────────┘ # Strategy: Separate contexts for different concerns
class IsolatedContextAgent:
def __init__():
self.planningContext = [] # High-level reasoning
self.executionContext = [] # Tool interactions
self.memoryContext = [] # Long-term facts
function plan(task):
# Planning uses only high-level context
response = llm.generate(
systemPrompt: "You are a planning agent...",
messages: self.planningContext + [task],
tools: [] # No tools during planning
)
self.planningContext.append(task)
self.planningContext.append(response)
return parsePlan(response)
function execute(step):
# Execution uses separate context with tools
relevantMemory = self.memoryContext.search(step)
response = llm.generate(
systemPrompt: "Execute this step precisely...",
messages: [
memoryContext(relevantMemory),
stepInstruction(step)
],
tools: allTools
)
# Update execution context (bounded size)
self.executionContext.append(step, response)
self.executionContext = keepRecent(self.executionContext, 10)
# Extract facts for memory
facts = extractFacts(response)
self.memoryContext.store(facts)
return response
function run(task):
plan = self.plan(task)
for step in plan.steps:
result = self.execute(step)
# Check if replanning needed
if result.requiresReplan:
plan = self.plan(
"Replan given: " + result.summary
)
return synthesizeResults(plan) from typing import List, Dict
from dataclasses import dataclass, field
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
from langchain_core.prompts import ChatPromptTemplate
@dataclass
class IsolatedContextAgent:
"""Agent with separate contexts for planning vs execution."""
llm: ChatOpenAI = field(default_factory=lambda: ChatOpenAI(model="gpt-4"))
fast_llm: ChatOpenAI = field(default_factory=lambda: ChatOpenAI(model="gpt-4o-mini"))
planning_context: List = field(default_factory=list)
execution_context: List[Dict] = field(default_factory=list)
memory_store: List[Dict] = field(default_factory=list)
max_execution_history: int = 10
def plan(self, task: str) -> List[str]:
"""Plan using isolated planning context (no tools)."""
messages = [
SystemMessage(content="""You are a planning agent. Break down tasks
into clear, executable steps. Do not execute - only plan.
Output format:
STEP 1: [action]
STEP 2: [action]
..."""),
*self.planning_context,
HumanMessage(content=task)
]
response = self.llm.invoke(messages)
plan_response = response.content
# Update planning context
self.planning_context.append(HumanMessage(content=task))
self.planning_context.append(AIMessage(content=plan_response))
return self._parse_steps(plan_response)
def execute(self, step: str, tools: List) -> Dict:
"""Execute step in isolated execution context."""
relevant_memory = self._search_memory(step)
# Bind tools for execution
llm_with_tools = self.llm.bind_tools(tools)
messages = [
SystemMessage(content=f"""Execute this step precisely.
Relevant context from memory:
{relevant_memory}
Complete the step and report results."""),
HumanMessage(content=f"Execute: {step}")
]
response = llm_with_tools.invoke(messages)
# Update bounded execution context
self.execution_context.append({"step": step, "result": response.content})
self.execution_context = self.execution_context[-self.max_execution_history:]
# Extract and store facts
self._extract_and_store_facts(response.content)
return {"content": response.content, "tool_calls": response.tool_calls}
def _search_memory(self, query: str, top_k: int = 3) -> str:
"""Search memory store for relevant facts."""
relevant = []
query_words = set(query.lower().split())
for item in self.memory_store:
item_words = set(item["content"].lower().split())
overlap = len(query_words & item_words)
if overlap > 0:
relevant.append((item, overlap))
relevant.sort(key=lambda x: x[1], reverse=True)
return "\n".join([f"- {item['content']}" for item, _ in relevant[:top_k]]) or "No relevant memory found."
def _extract_and_store_facts(self, content: str):
"""Extract key facts and add to memory."""
prompt = ChatPromptTemplate.from_messages([
("user", """Extract 1-3 key facts from this that should be remembered. Output one fact per line.
Content: {content}""")
])
chain = prompt | self.fast_llm
extraction = chain.invoke({"content": content})
facts = extraction.content.strip().split("\n")
for fact in facts:
if fact.strip():
self.memory_store.append({"content": fact.strip(), "source": "execution"})
def _parse_steps(self, plan: str) -> List[str]:
"""Parse plan into list of steps."""
steps = []
for line in plan.split("\n"):
if line.strip().startswith("STEP"):
parts = line.split(":", 1)
if len(parts) > 1:
steps.append(parts[1].strip())
return steps
# Usage
agent = IsolatedContextAgent()
# Planning happens in isolated context
steps = agent.plan("Research competitors and create summary report")
# Each step executes in separate context
for step in steps:
result = agent.execute(step, tools=research_tools)
print(f"Completed: {step}") using Microsoft.Extensions.AI;
public class IsolatedContextAgent
{
private readonly IChatClient _client;
private readonly List<ChatMessage> _planningContext = new();
private readonly Queue<ExecutionRecord> _executionContext = new();
private readonly List<MemoryItem> _memoryStore = new();
private const int MaxExecutionHistory = 10;
public IsolatedContextAgent(IChatClient client)
{
_client = client;
}
public async Task<List<string>> PlanAsync(string task)
{
// Planning uses isolated context - no tools
var messages = new List<ChatMessage>
{
new(ChatRole.System, """
You are a planning agent. Break down tasks
into clear, executable steps. Do not execute.
Output format:
STEP 1: [action]
STEP 2: [action]
""")
};
messages.AddRange(_planningContext);
messages.Add(new(ChatRole.User, task));
var response = await _client.GetResponseAsync(
messages,
new ChatOptions { Tools = [] } // No tools for planning
);
// Update planning context
_planningContext.Add(new(ChatRole.User, task));
_planningContext.Add(new(ChatRole.Assistant, response.Text));
return ParseSteps(response.Text);
}
public async Task<ExecutionResult> ExecuteAsync(
string step,
IList<AITool> tools)
{
// Retrieve relevant memory
var relevantMemory = SearchMemory(step);
var messages = new List<ChatMessage>
{
new(ChatRole.System, $"""
Execute this step precisely.
Relevant context from memory:
{relevantMemory}
Complete the step and report results.
"""),
new(ChatRole.User, $"Execute: {step}")
};
var response = await _client.GetResponseAsync(
messages,
new ChatOptions { Tools = tools }
);
// Update bounded execution context
_executionContext.Enqueue(new(step, response.Text));
while (_executionContext.Count > MaxExecutionHistory)
_executionContext.Dequeue();
// Extract and store facts
await ExtractAndStoreFactsAsync(response.Text);
return new ExecutionResult(response.Text, response.ToolCalls);
}
private string SearchMemory(string query, int topK = 3)
{
var queryWords = query.ToLower().Split(' ').ToHashSet();
var relevant = _memoryStore
.Select(m => (Item: m,
Score: m.Content.ToLower().Split(' ')
.Count(w => queryWords.Contains(w))))
.Where(x => x.Score > 0)
.OrderByDescending(x => x.Score)
.Take(topK)
.Select(x => $"- {x.Item.Content}");
return string.Join("\n", relevant) ?? "No relevant memory.";
}
private async Task ExtractAndStoreFactsAsync(string content)
{
var response = await _client.GetResponseAsync($"""
Extract 1-3 key facts to remember:
{content}
""");
foreach (var fact in response.Text.Split('\n'))
{
if (!string.IsNullOrWhiteSpace(fact))
_memoryStore.Add(new(fact.Trim(), "execution"));
}
}
private static List<string> ParseSteps(string plan)
{
return plan.Split('\n')
.Where(l => l.TrimStart().StartsWith("STEP"))
.Select(l => l.Split(':', 2).LastOrDefault()?.Trim() ?? "")
.Where(s => !string.IsNullOrEmpty(s))
.ToList();
}
}
public record ExecutionRecord(string Step, string Result);
public record MemoryItem(string Content, string Source);
public record ExecutionResult(string Content, IList<AIToolCall>? ToolCalls); When to Isolate
Combining Strategies
In practice, effective agents combine multiple strategies. Here's a typical pattern:
User Query
│
▼
┌─────────────────┐
│ SELECT │ ◄── Retrieve relevant docs/memory
└────────┬────────┘
│
▼
┌─────────────────┐
│ COMPRESS │ ◄── Summarize if over budget
└────────┬────────┘
│
▼
┌─────────────────┐
│ ISOLATE │ ◄── Route to appropriate context
└────────┬────────┘ (planning vs execution)
│
▼
┌─────────────────┐
│ WRITE │ ◄── Update scratchpad with findings
└────────┬────────┘
│
▼
Next iteration | Scenario | Primary Strategy | Supporting Strategies |
|---|---|---|
| RAG chatbot | Select | Compress (for long docs) |
| Coding agent | Write + Isolate | Select (for relevant files) |
| Research assistant | Select + Write | Compress (for sources) |
| Multi-step workflow | Isolate | Write (scratchpad), Compress (history) |
Strategy combinations for common scenarios
Evaluation Approach
Measuring context engineering effectiveness requires tracking both efficiency and quality metrics:
| Metric | What it Measures | Target |
|---|---|---|
| Token efficiency | Useful tokens / total tokens | Higher is better (aim for >70%) |
| Retrieval precision | Relevant items retrieved / total retrieved | >80% for Select strategy |
| Compression fidelity | Key facts retained after compression | >95% for critical facts |
| Task success rate | Tasks completed correctly | Compare before/after optimization |
| Cost per task | Total tokens × price per token | Track reduction over baseline |
Key metrics for context engineering evaluation
A/B Testing Approach
Run the same tasks with different context strategies and compare: (1) task completion rate, (2) response quality (LLM-as-judge), (3) average tokens per task, (4) latency. The best strategy maximizes quality while minimizing tokens.