Agent Memory Systems
How agents maintain context, learn from past interactions, and build persistent knowledge across sessions.
Why Memory Matters
Without memory systems, every agent interaction starts from scratch. Memory enables agents to:
- Remember user preferences and past decisions
- Learn from successful (and failed) task completions
- Maintain context across long conversations
- Build knowledge bases from interactions
- Personalize responses based on history
Key Insight
Types of Agent Memory
┌─────────────────────────────────────────────────────────────┐
│ Memory Architecture │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ WORKING MEMORY │
│ ───────────────── │
│ Current conversation in context window │
│ Scope: Current turn │ Storage: Context window │
│ Capacity: Model's context limit (4K-1M tokens) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ SHORT-TERM MEMORY │
│ ───────────────────── │
│ Facts extracted from current session │
│ Scope: Current session │ Storage: In-memory │
│ Example: "User asked about Python decorators" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LONG-TERM MEMORY │
│ ──────────────────── │
│ Persistent facts and knowledge │
│ Scope: Cross-session │ Storage: Vector DB │
│ Example: "User prefers concise responses" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EPISODIC MEMORY │
│ ──────────────────── │
│ Specific past experiences and outcomes │
│ Scope: Cross-session │ Storage: Indexed experiences │
│ Example: "Task X succeeded with approach Y" │
└─────────────────────────────────────────────────────────────┘ | Type | Scope | Implementation | Use Case |
|---|---|---|---|
| Working | Current conversation | Context window | Immediate task context |
| Short-term | Current session | In-memory store | Session-specific facts |
| Long-term | Cross-session | Vector DB + metadata | User preferences, knowledge |
| Episodic | Specific interactions | Indexed experiences | Learning from past tasks |
Comparison of memory types
class AgentMemory:
# Working Memory: Current conversation context
workingMemory = [] # Lives in context window
# Short-term Memory: Current session facts
shortTermMemory = InMemoryStore()
# Long-term Memory: Persistent across sessions
longTermMemory = VectorDatabase()
# Episodic Memory: Specific past experiences
episodicMemory = IndexedExperienceStore()
function remember(information, memoryType):
if memoryType == "working":
workingMemory.append(information)
elif memoryType == "short_term":
shortTermMemory.store(information)
elif memoryType == "long_term":
embedding = embed(information)
longTermMemory.store(embedding, information)
elif memoryType == "episodic":
episode = createEpisode(information)
episodicMemory.index(episode)
function recall(query, memoryTypes = ["all"]):
results = []
if "working" in memoryTypes or "all" in memoryTypes:
results += searchWorkingMemory(query)
if "short_term" in memoryTypes or "all" in memoryTypes:
results += shortTermMemory.search(query)
if "long_term" in memoryTypes or "all" in memoryTypes:
embedding = embed(query)
results += longTermMemory.similaritySearch(embedding)
if "episodic" in memoryTypes or "all" in memoryTypes:
results += episodicMemory.searchRelevant(query)
return rankAndMerge(results) from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.memory import ConversationSummaryBufferMemory
from langchain_core.messages import HumanMessage, AIMessage
from datetime import datetime
class AgentMemorySystem:
def __init__(self, persist_directory: str = "./memory_store"):
# Working memory: LangChain's conversation buffer with summarization
self.llm = ChatOpenAI(model="gpt-4")
self.working_memory = ConversationSummaryBufferMemory(
llm=self.llm,
max_token_limit=2000,
return_messages=True
)
# Short-term: In-memory for current session
self.short_term: list[dict] = []
# Long-term: Chroma with LangChain integration
self.embeddings = OpenAIEmbeddings()
self.long_term = Chroma(
collection_name="agent_memory",
embedding_function=self.embeddings,
persist_directory=persist_directory
)
def add_to_working_memory(self, human_msg: str, ai_msg: str):
"""Add exchange to working memory with auto-summarization."""
self.working_memory.save_context(
{"input": human_msg},
{"output": ai_msg}
)
def store_short_term(self, content: str, metadata: dict = None):
"""Store fact for current session."""
self.short_term.append({
"content": content,
"metadata": metadata or {},
"timestamp": datetime.now()
})
def store_long_term(self, content: str, metadata: dict = None):
"""Store in persistent vector database."""
self.long_term.add_texts(
texts=[content],
metadatas=[metadata or {}],
ids=[f"mem_{datetime.now().timestamp()}"]
)
def recall(self, query: str, n_results: int = 5) -> list[str]:
"""Retrieve relevant memories from all sources."""
results = []
# Search long-term memory with similarity search
long_term_docs = self.long_term.similarity_search(
query, k=n_results
)
results.extend([doc.page_content for doc in long_term_docs])
# Include relevant short-term memories
for mem in self.short_term:
if self._is_relevant(query, mem["content"]):
results.append(mem["content"])
# Get summarized working memory
working = self.working_memory.load_memory_variables({})
if working.get("history"):
results.append(f"Recent context: {working['history']}")
return results[:n_results] using Microsoft.Extensions.AI;
using Microsoft.Extensions.VectorData;
using Azure.AI.OpenAI;
public class AgentMemorySystem
{
private readonly List<ChatMessage> _workingMemory = new();
private readonly List<MemoryRecord> _shortTermMemory = new();
private readonly IVectorStore _vectorStore;
private readonly IEmbeddingGenerator<string, Embedding<float>> _embedder;
private readonly string _collectionName = "agent_memories";
public AgentMemorySystem(
IVectorStore vectorStore,
IEmbeddingGenerator<string, Embedding<float>> embedder)
{
_vectorStore = vectorStore;
_embedder = embedder;
}
// Working Memory: Current conversation
public void AddToWorkingMemory(ChatMessage message)
{
_workingMemory.Add(message);
if (CountTokens(_workingMemory) > 6000)
{
CompressWorkingMemory();
}
}
// Short-term: Current session only
public void StoreShortTerm(string content, Dictionary<string, string>? metadata = null)
{
_shortTermMemory.Add(new MemoryRecord
{
Content = content,
Type = MemoryType.ShortTerm,
Timestamp = DateTime.UtcNow,
Metadata = metadata ?? new()
});
}
// Long-term: Persistent with vector search
public async Task StoreLongTermAsync(string content, string? id = null)
{
var memoryId = id ?? $"mem_{DateTime.UtcNow.Ticks}";
var embedding = await _embedder.GenerateEmbeddingAsync(content);
var collection = _vectorStore.GetCollection<string, MemoryRecord>(_collectionName);
await collection.UpsertAsync(new MemoryRecord
{
Id = memoryId,
Content = content,
Embedding = embedding.Vector,
Timestamp = DateTime.UtcNow
});
}
// Recall relevant memories using vector similarity
public async Task<List<string>> RecallAsync(string query, int limit = 5)
{
var results = new List<string>();
var queryEmbedding = await _embedder.GenerateEmbeddingAsync(query);
var collection = _vectorStore.GetCollection<string, MemoryRecord>(_collectionName);
var searchResults = await collection.VectorizedSearchAsync(
queryEmbedding.Vector,
new VectorSearchOptions { Top = limit }
);
await foreach (var result in searchResults.Results)
{
results.Add(result.Record.Content);
}
// Include relevant short-term memories
results.AddRange(_shortTermMemory
.Where(m => IsRelevant(query, m.Content))
.Select(m => m.Content)
.Take(limit));
return results.Take(limit).ToList();
}
private void CompressWorkingMemory() { /* ... */ }
} Working Memory: Summarization
The context window has limits. When conversations get long, you need strategies to compress history while preserving important information:
class ConversationSummarizer:
threshold = 4000 # tokens
summaryRatio = 0.3 # Compress to 30%
function maybeCompress(messages):
tokens = countTokens(messages)
if tokens < threshold:
return messages
# Split into chunks to summarize
toSummarize = messages[:-5] # Keep recent 5
recent = messages[-5:]
# Generate summary
summary = llm.generate(
prompt: "Summarize this conversation concisely:",
content: toSummarize
)
# Return compressed version
return [
systemMessage(f"Summary of earlier conversation: {summary}"),
...recent
]
function progressiveSummarize(messages, levels = 3):
# Multi-level summarization for very long conversations
current = messages
for level in range(levels):
if countTokens(current) < threshold:
break
# Summarize in chunks
chunks = splitIntoChunks(current, chunkSize=10)
summaries = [summarize(chunk) for chunk in chunks]
current = summaries
return current from langchain_openai import ChatOpenAI
from langchain.memory import ConversationSummaryBufferMemory
from langchain_core.messages import get_buffer_string
class ConversationSummarizer:
def __init__(self, max_token_limit: int = 4000):
self.llm = ChatOpenAI(model="gpt-4")
# Built-in summarization when buffer exceeds limit
self.memory = ConversationSummaryBufferMemory(
llm=self.llm,
max_token_limit=max_token_limit,
return_messages=True
)
def add_exchange(self, human_input: str, ai_output: str):
"""Add conversation exchange with auto-summarization."""
# LangChain automatically summarizes when limit exceeded
self.memory.save_context(
{"input": human_input},
{"output": ai_output}
)
def get_context(self) -> str:
"""Get current memory context (summarized + recent)."""
return self.memory.load_memory_variables({})
def clear(self):
"""Clear all memory."""
self.memory.clear()
# For more control, use ConversationSummaryMemory directly
from langchain.memory import ConversationSummaryMemory
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
class CustomSummarizer:
def __init__(self):
self.llm = ChatOpenAI(model="gpt-4")
# Custom summarization prompt
self.summary_prompt = PromptTemplate(
input_variables=["summary", "new_lines"],
template="""Progressively summarize the conversation, adding to the summary.
Current summary: {summary}
New lines: {new_lines}
New summary:"""
)
self.memory = ConversationSummaryMemory(
llm=self.llm,
prompt=self.summary_prompt
)
# Usage with agent
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
llm,
tools,
# Checkpointer provides built-in memory persistence
checkpointer=MemorySaver()
) using Microsoft.Extensions.AI;
using Microsoft.ML.Tokenizers;
public class ConversationSummarizer
{
private readonly IChatClient _client;
private readonly Tokenizer _tokenizer;
private readonly int _threshold;
public ConversationSummarizer(
IChatClient client,
int threshold = 4000)
{
_client = client;
_threshold = threshold;
_tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
}
public int CountTokens(IEnumerable<ChatMessage> messages)
{
var text = string.Join("\n",
messages.Select(m => m.Text ?? ""));
return _tokenizer.CountTokens(text);
}
public async Task<string> SummarizeAsync(
IEnumerable<ChatMessage> messages)
{
var content = string.Join("\n",
messages.Select(m => $"{m.Role}: {m.Text}"));
var response = await _client.GetResponseAsync(new[]
{
new ChatMessage(ChatRole.System,
"Summarize this conversation concisely. " +
"Preserve key facts, decisions, and context."),
new ChatMessage(ChatRole.User, content)
});
return response.Message.Text ?? "";
}
public async Task<List<ChatMessage>> CompressIfNeededAsync(
List<ChatMessage> messages,
int keepRecent = 5)
{
if (CountTokens(messages) < _threshold)
return messages;
var systemMsgs = messages
.Where(m => m.Role == ChatRole.System)
.ToList();
var otherMsgs = messages
.Where(m => m.Role != ChatRole.System)
.ToList();
var toSummarize = otherMsgs
.Take(otherMsgs.Count - keepRecent)
.ToList();
var recent = otherMsgs
.Skip(otherMsgs.Count - keepRecent)
.ToList();
if (toSummarize.Count == 0)
return messages;
var summary = await SummarizeAsync(toSummarize);
var result = new List<ChatMessage>(systemMsgs);
result.Add(new ChatMessage(ChatRole.System,
$"Summary of earlier conversation: {summary}"));
result.AddRange(recent);
return result;
}
} Summarization Strategy
Episodic Memory: Learning from Experience
Episodic memory stores complete interaction trajectories, enabling agents to learn from past successes and failures:
class EpisodicMemory:
# Store complete interaction episodes for learning
episodes = []
function recordEpisode(task, trajectory, outcome):
episode = {
task: task,
trajectory: trajectory, # Full action sequence
outcome: outcome, # Success/failure + details
timestamp: now(),
embedding: embed(task + outcome)
}
episodes.append(episode)
persistToDatabase(episode)
function retrieveSimilarEpisodes(currentTask, k = 3):
# Find past experiences relevant to current task
taskEmbedding = embed(currentTask)
similar = vectorSearch(
episodes,
taskEmbedding,
topK = k
)
# Prioritize successful episodes
return sortBySuccess(similar)
function learnFromEpisode(episode):
if episode.outcome.success:
# Extract successful strategy
return {
type: "positive",
lesson: "When {task}, this approach worked: {summary}"
}
else:
# Learn from failure
return {
type: "negative",
lesson: "When {task}, avoid: {failureReason}"
} from dataclasses import dataclass, field
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document
from datetime import datetime
import json
@dataclass
class Episode:
task: str
trajectory: list[dict] # List of {thought, action, observation}
outcome: dict # {success: bool, result: str, error: str?}
timestamp: datetime = field(default_factory=datetime.now)
class EpisodicMemory:
def __init__(self, persist_path: str = "./episodic_memory"):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = Chroma(
collection_name="episodes",
embedding_function=self.embeddings,
persist_directory=persist_path
)
def record_episode(
self,
task: str,
trajectory: list[dict],
outcome: dict
) -> str:
"""Record a complete interaction episode."""
episode_id = f"ep_{datetime.now().timestamp()}"
# Create searchable content
search_content = f"{task}\n{outcome.get('result', '')}"
# Store as LangChain Document with metadata
doc = Document(
page_content=search_content,
metadata={
"task": task,
"trajectory": json.dumps(trajectory),
"outcome": json.dumps(outcome),
"success": outcome.get("success", False),
"timestamp": datetime.now().isoformat()
}
)
self.vectorstore.add_documents([doc], ids=[episode_id])
return episode_id
def retrieve_similar(
self,
current_task: str,
k: int = 3,
success_only: bool = False
) -> list[Episode]:
"""Find similar past episodes using semantic search."""
filter_dict = {"success": True} if success_only else None
# LangChain similarity search with metadata filtering
results = self.vectorstore.similarity_search(
current_task,
k=k * 2,
filter=filter_dict
)
episodes = [
Episode(
task=doc.metadata["task"],
trajectory=json.loads(doc.metadata["trajectory"]),
outcome=json.loads(doc.metadata["outcome"]),
timestamp=datetime.fromisoformat(doc.metadata["timestamp"])
)
for doc in results
]
# Sort by success first, then recency
episodes.sort(
key=lambda e: (not e.outcome.get("success"), -e.timestamp.timestamp())
)
return episodes[:k]
def generate_few_shot_examples(self, current_task: str, n: int = 2) -> str:
"""Generate few-shot examples from past successes."""
episodes = self.retrieve_similar(current_task, k=n, success_only=True)
examples = []
for ep in episodes:
lines = [f"Task: {ep.task}"]
for step in ep.trajectory[-3:]:
lines.append(f"Thought: {step.get('thought', '')}")
lines.append(f"Action: {step.get('action', '')}")
lines.append(f"Result: {ep.outcome.get('result', '')}")
examples.append("\n".join(lines))
return "\n---\n".join(examples) using Microsoft.Extensions.AI;
using Microsoft.Extensions.VectorData;
using System.Text.Json;
public class EpisodicMemory
{
private readonly IVectorStore _vectorStore;
private readonly IEmbeddingGenerator<string, Embedding<float>> _embedder;
private const string CollectionName = "episodes";
public record Episode(
string Task,
List<TrajectoryStep> Trajectory,
Outcome Outcome,
DateTime Timestamp
);
public record TrajectoryStep(
string Thought,
string Action,
string Observation
);
public record Outcome(
bool Success,
string Result,
string? Error = null
);
public EpisodicMemory(
IVectorStore vectorStore,
IEmbeddingGenerator<string, Embedding<float>> embedder)
{
_vectorStore = vectorStore;
_embedder = embedder;
}
public async Task<string> RecordEpisodeAsync(
string task,
List<TrajectoryStep> trajectory,
Outcome outcome)
{
var episodeId = $"ep_{DateTime.UtcNow.Ticks}";
var searchContent = $"{task}\n{outcome.Result}";
var embedding = await _embedder.GenerateEmbeddingAsync(searchContent);
var collection = _vectorStore.GetCollection<string, EpisodeRecord>(CollectionName);
await collection.UpsertAsync(new EpisodeRecord
{
Id = episodeId,
Content = searchContent,
Embedding = embedding.Vector,
Task = task,
TrajectoryJson = JsonSerializer.Serialize(trajectory),
OutcomeJson = JsonSerializer.Serialize(outcome),
Success = outcome.Success,
Timestamp = DateTime.UtcNow
});
return episodeId;
}
public async Task<List<Episode>> RetrieveSimilarAsync(
string currentTask,
int k = 3,
bool successOnly = false)
{
var episodes = new List<Episode>();
var queryEmbedding = await _embedder.GenerateEmbeddingAsync(currentTask);
var collection = _vectorStore.GetCollection<string, EpisodeRecord>(CollectionName);
var results = await collection.VectorizedSearchAsync(
queryEmbedding.Vector,
new VectorSearchOptions { Top = k * 2 }
);
await foreach (var result in results.Results)
{
if (successOnly && !result.Record.Success) continue;
episodes.Add(new Episode(
result.Record.Task,
JsonSerializer.Deserialize<List<TrajectoryStep>>(result.Record.TrajectoryJson)!,
JsonSerializer.Deserialize<Outcome>(result.Record.OutcomeJson)!,
result.Record.Timestamp
));
if (episodes.Count >= k) break;
}
return episodes
.OrderByDescending(e => e.Outcome.Success)
.ThenByDescending(e => e.Timestamp)
.Take(k)
.ToList();
}
public async Task<string> GenerateFewShotExamplesAsync(
string currentTask,
int nExamples = 2)
{
var episodes = await RetrieveSimilarAsync(currentTask, nExamples, successOnly: true);
var examples = episodes.Select(ep =>
{
var steps = string.Join("\n",
ep.Trajectory.TakeLast(3).Select(s =>
$"Thought: {s.Thought}\nAction: {s.Action}"));
return $"Task: {ep.Task}\n{steps}\nResult: {ep.Outcome.Result}";
});
return string.Join("\n---\n", examples);
}
} Few-Shot from Experience
Production Memory: Mem0
Mem0 is a popular framework for production memory systems, handling the complexity of memory extraction, storage, and retrieval:
from mem0 import Memory
# Initialize Mem0 with configuration
config = {
"llm": {
"provider": "openai",
"config": {
"model": "gpt-4",
"temperature": 0.1
}
},
"embedder": {
"provider": "openai",
"config": {
"model": "text-embedding-3-small"
}
},
"vector_store": {
"provider": "chroma",
"config": {
"collection_name": "agent_memories",
"path": "./mem0_data"
}
}
}
memory = Memory.from_config(config)
# Add memories with user context
memory.add(
"User prefers dark mode and uses VS Code",
user_id="user_123",
metadata={"category": "preferences"}
)
memory.add(
"User is working on a Python FastAPI project",
user_id="user_123",
metadata={"category": "context"}
)
# Search memories
results = memory.search(
"What IDE does the user prefer?",
user_id="user_123"
)
# Results include relevance scores
for result in results:
print(f"Memory: {result['memory']}")
print(f"Score: {result['score']}")
# Get all memories for a user
all_memories = memory.get_all(user_id="user_123")
# Update a memory
memory.update(
memory_id=results[0]["id"],
data="User prefers dark mode, uses VS Code with Vim keybindings"
)
# Memory statistics
stats = memory.history(user_id="user_123")
print(f"Total memories: {len(stats)}") # Integrating Mem0 with an agent
class MemoryAugmentedAgent:
memory = Mem0()
llm = ChatModel()
function respond(userMessage, userId):
# 1. Retrieve relevant memories
memories = memory.search(userMessage, userId)
memoryContext = formatMemories(memories)
# 2. Build prompt with memory context
prompt = f"""
User memories:
{memoryContext}
Current request: {userMessage}
"""
# 3. Generate response
response = llm.generate(prompt)
# 4. Extract and store new memories
newFacts = extractFacts(userMessage, response)
for fact in newFacts:
memory.add(fact, userId)
return response
# Key benefit: 80% token reduction while preserving fidelity
# Instead of keeping entire history, store and retrieve facts Mem0 Key Features
- Automatic memory extraction from conversations
- User-scoped and agent-scoped memories
- Conflict resolution for contradicting facts
- Memory decay and importance ranking
- Multiple vector store backends
Memory Design Patterns
| Pattern | Description | When to Use |
|---|---|---|
| Rolling Window | Keep last N messages only | Simple chatbots, low-stakes tasks |
| Summarize + Recent | Summarize old, keep recent verbatim | Most agent applications |
| Entity Memory | Track entities and their states | Complex workflows, state machines |
| Knowledge Graph | Store facts as relationships | Domain-specific agents, reasoning |
| Hierarchical | Multiple summary levels | Very long conversations (100+ turns) |
Evaluation Approach
Memory systems should be evaluated on both retrieval quality and downstream task performance:
| Metric | What it Measures | How to Measure |
|---|---|---|
| Recall Accuracy | Can agent retrieve relevant facts? | Insert facts, query later, measure hit rate |
| Recall@Turn N | Accuracy degradation over turns | Track accuracy vs conversation length |
| Token Efficiency | Tokens used vs full history | Compare memory system vs raw context |
| Latency Impact | Time added by memory operations | Benchmark retrieval + storage time |
| Task Performance | Does memory improve outcomes? | A/B test with vs without memory |
Key metrics for memory system evaluation
Test: Information Retention Over Conversation Length ─────────────────────────────────────────────────────────── Turn 1: Insert fact: "Project deadline is March 15" Turn 5: Query: "When is the deadline?" → Should recall Turn 10: Insert distractors about other dates Turn 15: Query: "What's the project deadline?" → Still recall? Turn 25: Heavy topic changes Turn 30: Query: "Remind me of the deadline" → Can still recall? Expected Results: ┌─────────────────────────────────────────────────────────┐ │ Recall Accuracy │ │ 100% ┤■■■■■■■■■■■■■■■■■■ │ │ 90% ┤ ■■■■■■■■■■ │ │ 80% ┤ ■■■■■■■■ │ │ 70% ┤ ■■■■ │ │ └──────────────────────────────────────────────────│ │ Turn 1 Turn 10 Turn 20 Turn 30 │ └─────────────────────────────────────────────────────────┘ Analysis Questions: - At what turn count does recall degrade? - Does summarization help or hurt? - What's the optimal compression threshold?
Common Pitfalls
Memory Pollution
Conflicting Memories
Over-Summarization
Retrieval Latency
Implementation Checklist
- 1. Define memory types needed (working, short-term, long-term, episodic)
- 2. Choose vector database (Chroma, Pinecone, Weaviate, Qdrant)
- 3. Implement summarization with token threshold
- 4. Design memory extraction logic (what to store, when)
- 5. Build retrieval with relevance filtering
- 6. Add memory update/invalidation for changing facts
- 7. Test recall accuracy at various conversation lengths
- 8. Monitor token usage and latency in production