Introduction: Memory as the Foundation of Intelligence
A Large Language Model without memory is like an expert with amnesia: brilliant at answering individual questions, but unable to build a continuous relationship, learn from past interactions, or accumulate knowledge over time. Memory is what transforms a stateless LLM into a persistent agent, one that remembers who you are, what you have discussed, and which decisions you have made together.
Without a memory system, every interaction starts from scratch. The agent does not know that yesterday you fixed a bug in the authentication module, does not remember that you prefer Python over JavaScript, has no idea that the project you are working on has a deadline in two weeks. With memory, however, the agent becomes a true collaborator: it accumulates context, recognizes recurring patterns, and improves its responses over time.
In this article we will explore memory systems for AI agents, from the simplest strategies like conversation history to advanced approaches using vector embeddings, knowledge graphs, and the hybrid Mem0 pattern that is emerging as a best practice in 2026. We will analyze the trade-offs between completeness and cost, accuracy and latency, providing pseudocode and concrete architectures for implementing each approach.
What You Will Learn in This Article
- The four types of memory in AI agents: sensory, short-term, long-term, and episodic
- Conversation history management strategies: sliding window, summarization, importance sampling
- How vector embeddings work and the RAG (Retrieval-Augmented Generation) pattern
- Knowledge graphs with Neo4j for relational reasoning
- The hybrid Mem0 pattern: vector store + knowledge graph combined
- Context window optimization and token budgets
- Benchmarking and trade-offs between different memory approaches
Types of Memory in AI Agents
Research on memory systems for AI agents draws inspiration from human cognitive psychology, adapting concepts like short-term and long-term memory to the context of language models. We can identify four fundamental types of memory, each with distinct characteristics and use cases.
The Four Types of Memory
| Type | Description | Duration | Example |
|---|---|---|---|
| Sensory Memory | Immediate input: the current prompt and tool results just received | Single iteration | The user message, the output of a SQL query |
| Short-term Memory | Current session context: the accumulated conversation history | Single session | Previous messages in the chat, decisions made |
| Long-term Memory | Persistent knowledge across different sessions, saved to external storage | Permanent | User preferences, company documents, source code |
| Episodic Memory | Specific memories of past interactions with relational context | Permanent | "The time we fixed the bug in the JSON parser" |
Sensory Memory: Immediate Input
Sensory memory represents everything the agent "perceives" at the current moment: the user message, the results of tools just executed, the responses from called APIs. It is the lowest and most transient level of memory, existing only for the duration of a single iteration of the agent loop. It requires no persistence mechanism because it lives entirely within the current prompt.
In practical terms, sensory memory corresponds to the parameters passed to the model in a single call: the system prompt, the user and assistant messages, and tool call results. It is limited by the model's context window, which defines the maximum number of tokens that can be processed in a single inference.
Short-term Memory: Session Context
Short-term memory maintains context within a single conversation session. It is implemented as the list of messages exchanged between user and agent, including intermediate tool call results. This memory allows the agent to reference what was previously said, resolve contextual ambiguities, and maintain coherence in dialogue.
The main problem with short-term memory is linear growth: the longer the conversation continues, the more tokens are consumed. With models that have 128K-200K token context windows, an intense technical conversation can reach the limit within a few hours of work. This makes it essential to implement compression and prioritization strategies.
Long-term Memory: Persistent Knowledge
Long-term memory is what allows the agent to remember information across different sessions. Unlike short-term memory that lives in the conversation, long-term memory requires external storage: a vector database, a relational database, a structured file system. The agent saves relevant information during interactions and retrieves it when needed in future sessions.
Examples of long-term memory include: user preferences (preferred programming language, coding style, naming conventions), indexed company documents, a project knowledge base, technical specifications and requirements. This type of memory transforms the agent from a generic assistant into a personalized, contextualized collaborator.
Episodic Memory: Specific Recollections
Episodic memory is the most sophisticated type: it is not just about memorizing facts, but about remembering specific experiences with their relational context. The agent remembers not only that "there is a bug in the parser", but that "on January 15th we diagnosed a bug in the JSON parser caused by incorrect encoding, which the user resolved by adding UTF-8 validation in pre-processing".
Episodic memory is typically implemented with knowledge graphs, where entities (people, projects, bugs, decisions) are nodes and relationships between them (resolved_by, caused_by, depends_on) are edges. This allows the agent to navigate connections between events and provide rich, pertinent context.
Conversation History Management
Conversation history is the most immediate form of memory: the complete list of messages exchanged in the current session. Managing it efficiently is crucial because the context window has a fixed limit, and exceeding it means losing information or, worse, receiving errors from the model. There are three main strategies, each with its own trade-offs.
Strategy 1: Sliding Window
The simplest approach: keep only the last N messages and discard older ones. It is easy to implement, predictable in token consumption, and requires no additional model calls. The disadvantage is that it completely loses the initial context of the conversation, which can lead to inconsistencies when the agent references decisions made at the beginning of the dialogue.
class SlidingWindowMemory:
def __init__(self, max_messages: int = 20):
self.max_messages = max_messages
self.messages: list[dict] = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# Always keep the system prompt (index 0)
if len(self.messages) > self.max_messages:
system = self.messages[0]
self.messages = [system] + self.messages[-(self.max_messages - 1):]
def get_context(self) -> list[dict]:
return self.messages.copy()
def token_count(self) -> int:
return sum(len(m["content"]) // 4 for m in self.messages)
Strategy 2: Progressive Summarization
Instead of discarding old messages, you progressively summarize them. When the conversation exceeds a token threshold, the older messages are compressed into a summary that preserves key information: decisions made, important facts, expressed preferences. The summary is prepended to the active conversation, maintaining context without consuming too many tokens.
This strategy requires additional model calls to generate summaries, which increases cost and latency. However, it preserves context much more effectively than the sliding window, because important information is never completely lost.
class SummarizationMemory:
def __init__(self, max_tokens: int = 4000, summary_threshold: int = 3000):
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
self.summary: str = ""
self.active_messages: list[dict] = []
def add_message(self, role: str, content: str):
self.active_messages.append({"role": role, "content": content})
if self._count_tokens(self.active_messages) > self.summary_threshold:
self._compress()
def _compress(self):
# Take the older messages (first half)
half = len(self.active_messages) // 2
old_messages = self.active_messages[:half]
self.active_messages = self.active_messages[half:]
# Generate a summary of the old messages
prompt = f"Summarize the key points of this conversation:\n"
for msg in old_messages:
prompt += f"{msg['role']}: {msg['content']}\n"
new_summary = llm_call(prompt) # Model call
self.summary = f"{self.summary}\n{new_summary}".strip()
def get_context(self) -> list[dict]:
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Previous conversation summary:\n{self.summary}"
})
context.extend(self.active_messages)
return context
Strategy 3: Importance Sampling
The most sophisticated approach: each message is evaluated for its importance and only relevant ones are kept in context. Importance can be determined by various factors: the presence of explicit decisions, references to key project entities, expression of preferences, definition of constraints or requirements.
This strategy produces the best results in terms of context quality, but is also the most complex to implement. It requires a model (or heuristic) to evaluate the importance of each message, and the risk of eliminating seemingly irrelevant but crucial information is always present.
class ImportanceSamplingMemory:
def __init__(self, max_tokens: int = 4000):
self.max_tokens = max_tokens
self.messages: list[dict] = []
self.importance_scores: list[float] = []
def add_message(self, role: str, content: str):
score = self._evaluate_importance(content)
self.messages.append({"role": role, "content": content})
self.importance_scores.append(score)
self._prune()
def _evaluate_importance(self, content: str) -> float:
score = 0.5 # Base score
# Explicit decisions
if any(kw in content.lower() for kw in ["decided", "choice", "will use", "implement"]):
score += 0.3
# Constraints and requirements
if any(kw in content.lower() for kw in ["must", "requirement", "constraint", "deadline"]):
score += 0.2
# Errors and bugs
if any(kw in content.lower() for kw in ["error", "bug", "fix", "problem"]):
score += 0.2
# Recent messages get a bonus
score = min(score, 1.0)
return score
def _prune(self):
while self._count_tokens() > self.max_tokens:
# Remove the message with the lowest importance
# (exclude system prompt and last message)
min_idx = min(
range(1, len(self.importance_scores) - 1),
key=lambda i: self.importance_scores[i]
)
self.messages.pop(min_idx)
self.importance_scores.pop(min_idx)
Strategy Trade-offs
| Strategy | Complexity | Additional Cost | Context Quality | Use Case |
|---|---|---|---|---|
| Sliding Window | Low | Zero | Medium | Short conversations, prototypes |
| Summarization | Medium | 1-2 LLM calls | High | Long sessions, complex tasks |
| Importance Sampling | High | Heuristic or LLM | Very High | Specialized agents, production |
Vector Embeddings and Retrieval-Augmented Generation
Vector embeddings are dense numerical representations of text in a multi-dimensional space. Each sentence, paragraph, or document is transformed into a vector of numbers (typically 768 or 1536 dimensions) where the geometric distance between vectors reflects the semantic similarity between the corresponding texts. Two sentences with similar meaning will have nearby vectors in space, even if they use completely different words.
How Embeddings Work
The embedding process uses a specialized neural model (such as text-embedding-3-small
from OpenAI or all-MiniLM-L6-v2 from Sentence Transformers) that takes a
text string as input and produces a numerical vector. These models are trained on enormous
amounts of text with the goal of positioning semantically similar sentences close together
in vector space.
Similarity search works by computing the cosine similarity or Euclidean distance between the query vector and the indexed document vectors. Documents with the smallest distance (or highest similarity) are the most relevant to the query.
import numpy as np
def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors."""
dot_product = np.dot(vec_a, vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
return dot_product / (norm_a * norm_b)
# Example: embedding two sentences
embedding_1 = embed("How to fix a bug in the JSON parser")
embedding_2 = embed("Fixing an error in JSON file parsing")
embedding_3 = embed("Recipe for chocolate cake")
# The first two sentences will have high similarity (~0.85-0.95)
# The third will be very distant (~0.05-0.15)
print(cosine_similarity(embedding_1, embedding_2)) # ~0.91
print(cosine_similarity(embedding_1, embedding_3)) # ~0.08
Vector Databases: Key Options
A vector database is a storage system optimized for storing and searching high-dimensional vectors. Unlike relational databases that use B-tree indexes, vector databases utilize algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to perform approximate searches in sub-linear time.
Vector Database Comparison
| Database | Type | Strength | Use Case |
|---|---|---|---|
| Pinecone | Managed (cloud) | Scalability, zero ops, serverless indexes | Large-scale production, teams without infra |
| Weaviate | Self-hosted / Cloud | Hybrid search (vector + BM25), GraphQL API | Advanced search with filters, e-commerce |
| Chroma | Embedded / Self-hosted | Lightweight, native Python API, local development | Prototypes, local apps, development |
| Qdrant | Self-hosted / Cloud | Performance, advanced filters, rich payloads | High-performance applications |
| pgvector | PostgreSQL Extension | Integration with existing stack, SQL | Teams already using PostgreSQL |
RAG Pipeline: Retrieval-Augmented Generation
The RAG (Retrieval-Augmented Generation) pattern is the fundamental technique for integrating long-term memory into AI agents. The idea is simple: before answering a question, the agent searches the knowledge base for the most relevant documents and includes them in the prompt as additional context. This allows the model to base its responses on specific, up-to-date information, reducing hallucinations.
class RAGPipeline:
def __init__(self, vector_db, embedding_model, llm):
self.vector_db = vector_db
self.embedding_model = embedding_model
self.llm = llm
def ingest(self, documents: list[str], metadata: list[dict] = None):
"""Index documents into the knowledge base."""
for i, doc in enumerate(documents):
# 1. Chunking: split the document into fragments
chunks = self._chunk_document(doc, chunk_size=512, overlap=50)
for chunk in chunks:
# 2. Embedding: generate the vector
vector = self.embedding_model.embed(chunk)
# 3. Store: save to the vector database
self.vector_db.upsert(
id=f"doc_{i}_{hash(chunk)}",
vector=vector,
text=chunk,
metadata=metadata[i] if metadata else {}
)
def query(self, question: str, top_k: int = 5) -> str:
"""Answer a question using RAG."""
# 1. Embed the query
query_vector = self.embedding_model.embed(question)
# 2. Retrieval: find the most similar documents
results = self.vector_db.search(
vector=query_vector,
top_k=top_k,
threshold=0.7 # Minimum similarity threshold
)
# 3. Context assembly: build the context
context = "\n\n---\n\n".join([r.text for r in results])
# 4. Generation: generate the response
prompt = f"""Answer the question based ONLY on the provided context.
If the context does not contain the answer, state so explicitly.
Context:
{context}
Question: {question}
Answer:"""
return self.llm.generate(prompt)
def _chunk_document(self, doc: str, chunk_size: int, overlap: int) -> list[str]:
"""Split a document into chunks with overlap."""
words = doc.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
Chunking Best Practices
- Chunk size: 256-1024 tokens. Too small loses context, too large introduces noise
- Overlap: 10-20% of chunk size to avoid splitting concepts in half
- Semantic chunking: prefer splitting by paragraphs or logical sections rather than character count
- Metadata: associate each chunk with source, date, and document type for result filtering
- Re-ranking: after retrieval, use a cross-encoder model to reorder results by relevance
Knowledge Graphs for Relational Memory
Knowledge graphs represent knowledge as a network of entities (nodes) connected by relationships (edges), each with associated properties. Unlike vector databases that excel at semantic similarity search, knowledge graphs shine in relational reasoning: understanding how entities connect to each other, following chains of relationships, and inferring new knowledge from existing connections.
Knowledge Graph Structure
A knowledge graph is composed of three fundamental elements:
- Nodes (Entities): represent domain objects. Each node has a type (label) and a set of properties. Example:
(:User {name: "Federico", role: "developer"}) - Edges (Relationships): connect two nodes and have a type and direction. Each edge can have properties. Example:
[:RESOLVED {date: "2026-01-15", time_spent: "2h"}] - Properties: key-value pairs associated with nodes or edges that add detail and context
Neo4j and the Cypher Language
Neo4j is the most widely used graph database, with a declarative query
language called Cypher that makes graph navigation intuitive. Cypher uses
a visual syntax that mirrors graph structure: nodes are represented in round parentheses
() and relationships in square brackets [] with arrows for direction.
// Create nodes and relationships for episodic memory
CREATE (u:User {name: "Federico", role: "developer"})
CREATE (p:Project {name: "E-commerce API", language: "Python"})
CREATE (b:Bug {id: "BUG-142", description: "JSON Parser crash on UTF-8"})
CREATE (s:Session {date: "2026-01-15", duration: "45min"})
// Relationships with properties
CREATE (u)-[:WORKS_ON {since: "2025-11-01"}]->(p)
CREATE (b)-[:FOUND_IN]->(p)
CREATE (u)-[:RESOLVED {method: "UTF-8 validation", time: "2h"}]->(b)
CREATE (s)-[:DISCUSSED]->(b)
CREATE (s)-[:INVOLVED]->(u)
// Query: Find all bugs resolved by Federico with context
MATCH (u:User {name: "Federico"})-[r:RESOLVED]->(b:Bug)-[:FOUND_IN]->(p:Project)
RETURN b.description, r.method, r.time, p.name
// Query: Reconstruct the timeline of a session
MATCH (s:Session {date: "2026-01-15"})-[:DISCUSSED]->(topic)
MATCH (s)-[:INVOLVED]->(participant)
RETURN s.date, collect(topic.description), collect(participant.name)
// Query: Find related bugs by project and type
MATCH (b1:Bug)-[:FOUND_IN]->(p:Project)<-[:FOUND_IN]-(b2:Bug)
WHERE b1 <> b2 AND b1.description CONTAINS "parser"
RETURN b1.description, b2.description, p.name
Advantages of Knowledge Graphs for Memory
Knowledge graphs offer unique advantages for AI agent memory that vector databases cannot replicate:
- Relational reasoning: the agent can navigate relationships between entities to discover non-obvious connections. "Federico resolved bug BUG-142 which was in the E-commerce project, and the same project has another similar bug BUG-198"
- Efficient context: instead of including the entire history in the prompt, the agent extracts only the relevant subgraph for the current query, saving tokens
- Inference: it is possible to define inference rules that derive new knowledge from existing relationships. If A depends on B and B depends on C, then A transitively depends on C
- Incremental updates: new information is added as new nodes and edges without needing to rebuild the entire index
- Explainability: the agent can trace the reasoning path through the graph, providing transparent explanations
Hybrid Approach: The Mem0 Pattern
The Mem0 Pattern: 2026 Best Practice
The emerging pattern in 2026 for AI agent memory systems combines vector store and knowledge graph in a single cohesive system. The name "Mem0" (pronounced "mem-zero") refers both to the eponymous open source project and the architectural pattern it introduced: a memory that starts from zero and grows organically with every interaction.
Mem0 Pattern Architecture
The fundamental idea is simple: when the agent receives new information, it saves it simultaneously in both storage systems. When it needs to retrieve context, it performs a hybrid search that combines results from both systems to maximize retrieval quality.
class HybridMemory:
"""Mem0 Pattern: Vector Store + Knowledge Graph combined."""
def __init__(self, vector_db, graph_db, embedding_model, llm):
self.vector_db = vector_db
self.graph_db = graph_db
self.embedding_model = embedding_model
self.llm = llm
def store(self, interaction: str, metadata: dict):
"""Save a new memory in both systems."""
# 1. Extract entities and relationships from text
entities, relations = self._extract_knowledge(interaction)
# 2. Vector Store: save the embedding of the full text
vector = self.embedding_model.embed(interaction)
self.vector_db.upsert(
id=metadata.get("id", str(hash(interaction))),
vector=vector,
text=interaction,
metadata=metadata
)
# 3. Knowledge Graph: save entities and relationships
for entity in entities:
self.graph_db.merge_node(entity.label, entity.properties)
for relation in relations:
self.graph_db.merge_relation(
relation.source, relation.type, relation.target,
properties=relation.properties
)
def retrieve(self, query: str, top_k: int = 5) -> dict:
"""Retrieve context from both systems."""
# 1. Vector similarity search
query_vector = self.embedding_model.embed(query)
vector_results = self.vector_db.search(query_vector, top_k=top_k)
# 2. Extract entities from query for graph traversal
query_entities = self._extract_entities(query)
# 3. Graph traversal: expand relational context
graph_context = []
for entity in query_entities:
neighbors = self.graph_db.get_neighbors(
entity, depth=2, limit=10
)
graph_context.extend(neighbors)
# 4. Combine and deduplicate results
return {
"vector_results": vector_results,
"graph_context": graph_context,
"combined_context": self._merge_results(
vector_results, graph_context
)
}
def _extract_knowledge(self, text: str):
"""Use an LLM to extract entities and relationships."""
prompt = f"""Extract entities and relationships from the following text.
Output format:
ENTITIES: [type:name:properties, ...]
RELATIONS: [source-RELATION_TYPE->target, ...]
Text: {text}"""
response = self.llm.generate(prompt)
return self._parse_extraction(response)
Hybrid Retrieval Flow
Hybrid retrieval follows a two-phase process that leverages the strengths of both systems:
- Vector Similarity for Narrowing: vector search quickly filters semantically relevant documents, reducing the search space from thousands to a few dozen results
- Graph Traversal for Enrichment: starting from entities found in vector results, graph traversal expands context by following relationships in the knowledge graph, adding correlated information that semantic similarity alone would not have found
- Re-ranking and Fusion: combined results are reordered by overall relevance, weighing both semantic similarity and graph distance
- Context Assembly: the final context is assembled respecting the token budget, with the most relevant results having priority
Context Window Optimization
The context window is an AI agent's most precious resource. Every token wasted on irrelevant context is one less token for the useful response. Context window optimization requires a systematic approach that balances the amount of available information with the quality and pertinence of the selected context.
Token Counting and Budget Allocation
The first step is defining a token budget for each prompt component. A typical allocation for an agent with a 128K token context window might be:
Token Budget per Component
| Component | Tokens Allocated | Percentage | Notes |
|---|---|---|---|
| System Prompt | 2,000 - 4,000 | 2-3% | Instructions, personality, constraints |
| Tool Definitions | 3,000 - 8,000 | 3-6% | Description and schema of available tools |
| Long-term Memory | 5,000 - 15,000 | 5-12% | RAG results, knowledge graph context |
| Conversation History | 20,000 - 50,000 | 15-40% | Recent messages, summaries |
| Reserved for Response | 8,000 - 16,000 | 6-12% | Space for the model's response |
| Safety Buffer | 5,000 - 10,000 | 4-8% | Margin for unexpected tool results |
Semantic Caching
Semantic caching is an optimization technique that avoids recomputing embeddings and queries for questions similar to those already processed. When the agent receives a query, it first checks whether a semantically similar query has already been processed recently. If similarity exceeds a threshold (typically 0.95), the cached result is returned without executing a new search in the vector database.
This approach significantly reduces latency and cost of retrieval operations, especially in scenarios where the user rephrases the same question in different ways or where successive queries are variations on the same theme.
Hierarchical Retrieval
The hierarchical retrieval approach organizes the knowledge base across multiple levels of granularity. At the highest level are general summaries of entire documents. At the intermediate level are section summaries. At the lowest level are the original chunks with all details.
The search starts from the highest level: if general summaries indicate a document is relevant, it descends to the section level, and then to the chunk level only for the most pertinent sections. This approach drastically reduces the number of embedding comparisons needed and improves result quality by eliminating noise.
Benchmarking and Trade-offs
Choosing the right memory system depends on the specific application context. There is no universally best solution: each approach has its own strengths and weaknesses, and the optimal choice depends on latency, cost, accuracy, and implementation complexity requirements.
Complete Memory Approach Comparison
| Approach | Latency | Accuracy | Cost | Complexity | Ideal For |
|---|---|---|---|---|---|
| Sliding Window | ~0ms | Low | Zero | Minimal | Simple chats, prototypes |
| Summarization | 200-500ms | Medium-High | Medium | Low | Long sessions, coding |
| Vector DB (RAG) | 50-200ms | High | Medium | Medium | Knowledge bases, documentation |
| Knowledge Graph | 100-500ms | Very High | High | High | Relational reasoning |
| Hybrid (Mem0) | 200-800ms | Excellent | High | Very High | Enterprise agents, production |
When to Use What
- Sliding Window Only: simple chatbots, brief interactions, quick prototypes where persistence is not needed
- Sliding Window + Summarization: long coding sessions, personal assistants where conversation context is sufficient
- Vector DB (RAG): when the agent needs to access an external knowledge base (documentation, source code, company FAQs) and queries are independent
- Knowledge Graph: when relationships between entities matter (project management, CRM, diagnostic systems) and multi-hop reasoning is needed
- Hybrid (Mem0): enterprise production agents where both semantic search and relational reasoning are needed, and the budget allows it
Practical Implementation: Production Considerations
Implementing a memory system in production requires attention to several aspects beyond the simple choice of algorithm. Concurrency management, data consistency, monitoring, and privacy are all critical concerns.
- Privacy and GDPR: memories can contain personal data. Always implement selective deletion mechanisms and explicit consent
- Temporal decay: old memories lose relevance. Implement a decay system that progressively reduces the weight of unaccessed memories
- Conflicts and updates: when new information contradicts existing memories, the agent must decide which version to keep. Always favor the most recent information with an update log
- Monitoring: track metrics like retrieval hit rate, average latency, token distribution per component, and perceived response quality
- Graceful fallback: if the memory system fails (vector DB down, corrupted graph), the agent must continue functioning by degrading to current conversation only
Conclusions
Memory is the component that transforms a stateless LLM into a truly intelligent and persistent agent. From simple conversation history to sophisticated hybrid systems with vector databases and knowledge graphs, each approach offers a different balance of complexity, cost, and context quality.
The Mem0 pattern, which combines vector store and knowledge graph, is emerging as the best practice for production agents in 2026. The key is not to choose a single approach, but to understand the trade-offs and select the right combination for your use case, starting simple and adding complexity only when necessary.
In the next article we will explore advanced tool calling: how agents integrate REST APIs, web services, and custom tools to act in the real world. We will cover JSON Schema for tool definitions, input validation and sanitization strategies, and how to build a reusable tool framework.







