AI agent memory: CoALA, Mem0, MemGPT, and A-Mem compared

AI agent memory without the hype: CoALA, MemGPT, Mem0, and A-Mem compared honestly. What each stores, where each fails, and whether you actually need memory.

A founder messaged us last month with a familiar request. Their agent kept "forgetting" things. Customer name in turn three, contract terms in turn seven. The fix, they were sure, was memory. They had read three Medium posts about Mem0 and were ready to wire it in over the weekend. Could we sanity check the plan.

We told them what we tell most teams that ask. Memory is rarely the first answer. Usually the third or fourth. Before reaching for it, be confident your retrieval is good, your context window is being used well, and your state is actually structured rather than smeared across a chat log. Most teams that think they need memory need better retrieval and a clearer data model. The wins are bigger, the cost is lower, and the failure modes are auditable.

That is the position this article defends. We walk through the four serious architectures the field has converged on (CoALA, MemGPT, Mem0, A-Mem), explain what each one does and where it came from, compare them on what they store, how they evict, and what they are good and bad at, and then turn to the question the field underweights: how do you evaluate any of this. The benchmarks (LoCoMo, LongMemEval) exist and they are useful. They are also young enough that most teams shipping memory features have never run them.

We close with the question we ask every team that comes in with a memory problem. Do you actually need memory, or do you need better retrieval. Most need the latter.

The Distinction the Field Keeps Blurring: Context, RAG, Memory

Before any architecture comparison makes sense, three things have to be pulled apart. They sit on adjacent shelves in most blog posts and the result is that teams pour effort into the wrong layer.

Context window is the working memory of a single model call. It is bounded by the model (200K for Claude Sonnet 4.5, 1M for Gemini 1.5 Pro), it costs money on every token, and it resets the moment the call ends. Calling it "short-term memory" is fair as a metaphor but misleading as an architecture decision. The context window isn't a memory system. It's the canvas a memory system paints on.

RAG (retrieval-augmented generation) fetches documents from an external corpus at query time and stuffs the relevant snippets into the context window. The corpus is static. You wrote it. The agent did not learn it. RAG solves "the model does not know our pricing page," not "the model does not remember what the user told it yesterday."

Agent memory grows out of what the agent does. Facts extracted from conversations. Conclusions drawn from past tasks. Procedural patterns learned from successful runs. It is dynamic, agent-managed, and (the part most teams underestimate) hard to evaluate. The agent decides what to remember, what to forget, and what to surface back into context. Every one of those decisions is a place where the system can go quietly wrong.

The field has been calling all three "memory" interchangeably for two years and the conflation is the source of a lot of bad architecture choices. A team that says "let's add memory to our agent" often means "let's improve the retrieval over our docs," which is RAG. A team that builds Mem0 on top of a chatbot that has no real long-running user relationship is paying for an architecture they will never use. A team that puts everything in context and calls that "Claude with memory" is going to find out about token costs at scale.

The first move on any memory project is not picking an architecture. It is drawing the boundary. What needs to live longer than one session, and why. "User preferences for personalization across weeks" is memory. "Our product documentation" is RAG. "This 50-turn conversation about one deal" might just be a longer context window and a summarization pass.

The rest of this article is about architectures for the first case. But the framing matters. None of these systems substitute for thinking carefully about what your agent actually needs to carry across time.

CoALA: The Framework That Gave the Field a Vocabulary

Before we look at running systems, we have to spend a section on a paper that does not ship code. CoALA is a conceptual framework, not an implementation, and reading the rest of the field without it is like reading distributed systems papers without having read the CAP theorem first.

Sumers, Yao, Narasimhan, and Griffiths, Cognitive Architectures for Language Agents, 2024 (arXiv:2309.02427).

What CoALA did was supply a taxonomy. The authors borrow from decades of cognitive science (Newell's Soar, ACT-R, the broader cognitive architecture literature) and apply that vocabulary to LLM agents. They split memory into two top-level categories, working and long-term, and then break long-term memory into three subtypes that have become the field's default carving:

Working memory. What is in the context window right now. The agent's scratchpad. Volatile, bounded, expensive per token.

Episodic memory. Specific past episodes. "In the session last Tuesday, this user asked about pricing for the enterprise tier." Time-indexed, narrative, often verbatim or near-verbatim.

Semantic memory. Generalized facts about the world and about the user. "Angelina is the founder of Garden Research." "Our enterprise tier starts at €30K." Atomic, deduplicated, retrievable by subject.

Procedural memory. Learned skills and action patterns. "When a user asks about pricing, first ask about team size, then about use case, then quote a range." This is the category most production systems handle worst, because it overlaps with system prompts, tool selection, and the agent's own learned heuristics in ways that are hard to clean up.

If you do nothing else with this article, internalize that four-way split. When a vendor tells you their system gives your agent "memory," ask them which of these four they mean. Most products do one or two well and the others not at all. Mem0 is mostly semantic extraction with light episodic indexing. MemGPT is closer to a working-memory pager that also persists episodes. A-Mem is mostly semantic with strong relational structure. None handle procedural memory natively in a way that survives close inspection.

The other thing CoALA introduces is the distinction between memory as passive storage and memory as agent-controlled state. A memory system worth the name has to expose explicit read, write, and update operations to the agent's decision loop. That framing is what makes MemGPT a real memory system (the agent manages memory through tool calls) and what makes a vector database with conversation embeddings less of one (the system is passive; the agent does not control the policy).

CoALA does not give you code. It gives you a way to talk about what you are building. Every serious memory paper since 2024 either builds on the CoALA taxonomy or explicitly argues with it. Read it. The first two-thirds is the part that matters.

MemGPT and Letta: Virtual Context as an Operating System

The first paper that turned the CoALA vocabulary into a running system worth taking seriously is MemGPT, now productized as Letta.

Packer, Wooders, Lin, Fang, Patil, Stoica, Gonzalez, MemGPT: Towards LLMs as Operating Systems, 2023 (arXiv:2310.08560).

The framing is the part to lead with. The authors notice that "LLM with a bounded context window that needs to handle more data than fits" is structurally the same problem operating systems solved decades ago with virtual memory: paging between physical memory and disk to give the program the illusion of unbounded address space.

MemGPT applies the same idea to LLMs. The agent has three explicit memory tiers:

Core memory. Always in the context window. Small. Things the agent considers essential: who the user is, what the current task is, key facts that should never have to be re-fetched.
Recall memory. The history of past interactions. Searchable but not in the current context unless paged in.
Archival memory. Larger storage. Documents, accumulated facts, long-form notes. Searchable, not in context unless paged in.

The agent manages tier-to-tier movement through function calls. It writes to core memory ("the user prefers concise replies, save that to core"), searches recall ("did we discuss the contract last week?"), searches archival ("look up the API spec"), and edits any tier. The model is doing the OS kernel's work.

The methodological move is the important part. Memory in MemGPT isn't a data structure. It's a policy about what the agent decides to remember, when, and how to retrieve. The infrastructure is simple (a few databases and an LLM-callable interface). The intelligence is in the agent's choices.

That also explains where MemGPT struggles. The agent makes mistakes about what to write to core. It forgets to update entries. It retrieves the wrong tier. It paginates context inefficiently when conversations get long. All are policy failures, not infrastructure failures, and policy failures are hard to fix from outside. You can train a model to be better at memory management, but you cannot patch the rules from a config file.

Letta is the productized version: open-source with a hosted offering, faithful to the paper's three tiers, agent-managed transitions, persistent storage in Postgres. Letta is the right starting point for prototyping the MemGPT idea without writing the orchestration. The free DeepLearning.AI short course LLMs as Operating Systems: Agent Memory uses Letta hands-on and is the fastest way to feel what the architecture does in practice.

What MemGPT is good at. Long-running single-agent conversations where the relationship matures over time. Cases where the agent genuinely benefits from deciding what to remember (a coding assistant that learns your preferences, a personal assistant that builds a model of your work patterns). Research and prototype settings.

What MemGPT is bad at. High-throughput multi-user systems where the policy cost of the agent managing its own memory becomes prohibitive at scale. Systems where memory quality has to be audited; the agent-as-pager model makes failure modes harder to reason about. Workflows where the data is structured (orders, tickets, transactions) and a relational database is just better.

MemGPT was the first serious answer to "how does an agent handle more than fits in context." It is still a defensible answer. It is not the answer everyone needs.

Mem0: Production Semantic Extraction with Hybrid Stores

If MemGPT is the academic inflection point, Mem0 is what showed up next and started getting deployed.

Chhikara, Khant, Aryan, Singh, Yadav, Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025 (arXiv:2504.19413).

The crucial architectural decision in Mem0 is upstream of the storage layer. Mem0 does not store the raw conversation. It runs an extraction step that pulls out atomic facts from the dialogue and only those facts go to the store. The raw turns are either dropped or kept in a separate log that the retrieval path mostly does not touch.

This sounds small. It is not. It changes the entire economics of memory:

The store stays compact. Atomic facts deduplicate. "User's name is Maya" written once is the same fact whether she mentioned it twice or twenty times. Raw conversation compounds at the rate of dialogue; an extracted fact base compounds at the rate of new information.
Retrieval becomes precise. Semantic search over atomic facts returns clean, targeted results. Semantic search over raw conversation returns whatever paragraph happens to contain the topic phrase.
Conflicts become visible. When Mem0 extracts a fact that contradicts an existing one ("user lives in Berlin" vs. earlier "user lives in Munich"), the conflict resolution policy fires. Raw logs hide contradictions; extracted stores surface them.
Privacy and governance get simpler. You can audit extracted facts in a way you cannot audit a million conversation turns. You can delete a specific fact on request. You can show the user what the system remembers about them.

On the retrieval side, Mem0 uses multi-signal retrieval. Three searches run in parallel: dense semantic over embeddings, sparse BM25 keyword, and entity-matching over a graph of extracted entities. The three result sets are merged and reranked. Any single retrieval method has predictable failure modes; combining them covers more of the question space.

Mem0 also handles temporal reasoning explicitly. The store knows when each fact was extracted and indexes by temporal class: "now," "were" (past states), "will be" (planned). A query about current preferences gets different ranking than one about historical preferences. Mundane in description, rare in practice; most production memory systems we look at do not handle time well.

The Mem0 paper reports strong benchmark numbers and the company has shipped a hosted product and an open-source SDK that drops into common agent frameworks. Numbers we have verified independently are less impressive than the paper, which is normal for any benchmark-led release, but the architecture is sound and the patterns it codifies are the right ones.

What Mem0 is good at. Multi-user production systems where extracted facts are more valuable than raw conversation and the team needs something that drops in alongside an existing agent. Customer-support agents, personal assistants with many concurrent users, sales agents that need a fact base per account.

What Mem0 is bad at. Cases where the value is in the texture of the conversation, not the extracted facts ("how did the user phrase that complaint" matters more than "the user complained about latency"). Cases where extraction quality cannot be guaranteed and a missed fact has a real cost (medical, legal, compliance). Highly relational domains where a graph-native architecture would be cleaner. Procedural memory.

The patterns Mem0 codifies (extract atomic facts, not raw turns; resolve conflicts on write; multi-signal retrieval; explicit temporal classes) are valuable independent of the library. We have built systems with those patterns using Postgres + pgvector and no Mem0 SDK. The ideas travel. The library is one way to ship them.

A-Mem: Zettelkasten-Inspired Graph Memory

The most interesting recent paper in the area takes the architecture in a direction that will be familiar to anyone who has spent time with Niklas Luhmann's note-taking method.

Xu, Hu, Wu, Zhang, He, Wang, A-MEM: Agentic Memory for LLM Agents, NeurIPS 2025 (arXiv:2502.12110).

A-Mem treats memory as a graph of interconnected atomic notes rather than a flat list. The inspiration is the Zettelkasten: thousands of small index cards, each with a single idea, connected through explicit links. Luhmann used it for forty years; the claim is that the linking structure itself does cognitive work the cards alone cannot.

The A-Mem implementation runs that idea through an LLM pipeline. When a new memory is added:

Generate a structured note. The fact is captured with rich attributes: contextual description, keywords, tags, source episode, timestamps.
Analyze the existing graph. The system queries the historical repository for related notes by semantic similarity, shared attributes, and entity co-occurrence.
Establish typed links. Connections are written between the new note and relevant existing ones, typed (elaboration, contradiction, exemplification, temporal-next).
Update affected notes. Existing notes can be amended: tags refined, summaries rewritten, outgoing links updated.

That last step is the move that makes A-Mem different. Most memory systems treat entries as immutable after write. A-Mem treats the whole graph as a living artifact that gets refined every time new information arrives.

Retrieval changes accordingly. Instead of a single semantic search returning top-k facts, A-Mem can do graph traversal: start from a seed fact, walk typed edges, return a connected subgraph. For queries that span multiple related facts, this returns substantially more coherent context than flat retrieval.

Parallel work. A-Mem is not alone. Zep: A Temporal Knowledge Graph Architecture for Agent Memory, the MAGMA (Multi-Graph Agentic Memory Architecture) line, and commercial implementations like Graphiti all push graph-structured memory. The shared bet: relational structure is the next frontier after flat semantic stores. We think it is right in the long run and overcooked in the short run, because most production teams do not yet have the operational maturity to debug a graph memory system when it goes wrong.

What A-Mem is good at. Research and knowledge-work agents where the relationships between facts matter as much as the facts themselves. Agents that work on long, evolving projects where context spans dozens of episodes and graph traversal is the right retrieval primitive. Cases where you want the memory itself to surface non-obvious connections (the Zettelkasten claim).

What A-Mem is bad at. Production deployments where you need predictable latency and behavior; the graph operations are expensive and the system's behavior is harder to reason about. Cases where the agent's workload does not actually produce a richly connected fact graph (most customer-facing chatbots, where each session is largely independent). Teams that do not have the evaluation discipline to catch when the graph starts producing wrong or spurious links, which it will.

A-Mem is the most interesting architecture on this list and the one we would default to least often in production. The distance between "interesting" and "we should ship this" is large in this space.

Comparison Table: What Each One Is For

The four architectures cover different parts of the design space. The table below is the compact version of the comparison. Read it as a starting point, not a verdict.

Architecture	What it stores / Retrieval method / Eviction / update policy / Best for / Bad for
01CoALA	(Framework, not a store) \| (Framework) \| (Framework) \| Vocabulary, taxonomy, design discipline \| Anything operational; you cannot run CoALA
02MemGPT / Letta	Three tiers: core (in-context), recall (history), archival (long-term docs) \| Agent-managed function calls; tier-by-tier search \| Agent decides what to promote to core, what to demote, what to summarize \| Long-running single-agent conversations; cases where the OS analogy is genuinely illuminating \| High-throughput multi-user systems; workloads where structured DBs are a better fit
03Mem0	Atomic facts extracted from conversations; entity graph; temporal indexing \| Hybrid: semantic (dense) + BM25 (sparse) + entity match, merged and reranked \| Conflict resolution on write; explicit temporal classes (now, past, planned); facts can be updated or superseded \| Multi-user production systems; clean fact-based memory that scales; teams that want a drop-in library \| Cases where conversation texture matters more than extracted facts; procedural memory; high-stakes domains where extraction errors are costly
04A-Mem	Atomic notes with structured attributes (descriptions, keywords, tags) connected via typed edges in a graph \| Semantic + graph traversal over typed links; can return connected subgraphs \| New notes trigger updates to related existing notes; the whole graph gets curated continuously \| Research and knowledge-work agents; long, evolving projects; cases where relational structure adds real value \| Production systems needing predictable latency; workloads where the graph stays sparse; teams without strong evaluation discipline

A few observations the table does not show explicitly.

None handle procedural memory well. All four are oriented toward declarative knowledge (facts, conversations, notes). Procedural memory (learned action patterns) sits awkwardly in all four. Teams that need it generally roll their own with system prompts, few-shot examples, and a separate store of successful task templates. Interesting work is happening here (case-based reasoning, agent self-reflection, episodic-to-procedural distillation) but no shipped library is ready.

The "best for" column is narrower than the marketing. Each architecture has a marketing claim that it handles "AI agent memory" broadly. Each has a sweet spot and a wide zone where the right answer is a different architecture or no memory layer at all.

Cost scales in different shapes. MemGPT's cost grows with conversation length per user. Mem0's cost grows with fact volume across the user base (extraction runs on every interaction). A-Mem's cost grows superlinearly with graph size. The right architecture depends partly on which scaling dimension you can afford and which you cannot.

How to Evaluate Memory: LoCoMo, LongMemEval, and Why Most Teams Skip This

This section exists because it is the part of the field most teams pretend does not exist. Memory systems are notoriously hard to evaluate and the benchmarks that do exist are not yet treated as table stakes. They should be.

The classical information-retrieval metrics (precision@k, NDCG, recall@k) do not work cleanly for agent memory. They assume a fixed corpus and a known set of relevant documents. Agent memory is a moving target: the corpus grows during evaluation, relevance is context-dependent, and the agent's decisions about reads and writes are part of what is being measured. Treating memory like IR misses most of what makes it interesting and most of what makes it fail.

The benchmarks built specifically for this problem are still young. The two worth knowing about:

LoCoMo (arXiv:2402.17753). Long Conversation Memory benchmark. Tests very-long-running conversational memory: up to 35 sessions, 300+ turns, 9 to 16K tokens per conversation. Three task types: factual QA against the history, event summarization across sessions, dialogue generation that requires the system to behave consistently with what it has been told.

LongMemEval. Stress-tests recall across large temporal gaps. Probes whether facts mentioned hundreds of turns earlier are still accessible, whether later statements correctly override earlier ones, and whether the system can combine facts from different points in the history. Harder than LoCoMo; most current systems fail meaningfully on long-gap recall.

MemBench. Newer benchmark that separates factual from reflective memory. Beyond "what did the user say" to "what did the user mean, given everything else they have said."

The standard reference is the survey Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers (arXiv:2603.07670), worth reading end to end if you are about to ship memory as a feature.

The survey's central methodological point is one most production teams have not absorbed: agent memory evaluation must jointly measure memory quality and decision quality. Knowing that a fact is in the store is not enough; the question is whether the agent's behavior is correct given that the fact exists. A system can have perfect recall and still produce a worse agent if retrieval surfaces irrelevant context at the wrong moments.

The survey also lists concerns that classical IR ignores and that agent memory needs to handle:

Staleness. Old facts that were true and no longer are. The store has to know when to demote them.
Contradiction. Two facts in the store that disagree. The store has to know which wins and why.
Forgetting quality. Memory systems that never forget grow unbounded and become noisier over time. Good forgetting (deleting low-value facts, summarizing redundant ones, archiving outdated context) is a feature, not a bug. Most current systems forget badly or not at all.
Governance compliance. GDPR, deletion rights, data residency. A memory system that cannot honor a deletion request is not production-ready in the EU. We have seen several teams discover this the hard way.

For a team evaluating memory architectures for their own workload, the honest path is:

Run LoCoMo or LongMemEval on your candidate architectures with your model. Public numbers are useful for shortlisting but they do not predict your specific behavior.
Build an evaluation set from your own data. Twenty real conversations annotated with what the agent should remember and what it should retrieve at which points. This is tedious. There is no shortcut.
Measure both memory quality (was the fact stored and retrieved correctly) and decision quality (did the agent behave correctly given the retrieved context). Score them separately.
Run the eval before launch, after launch, and on a recurring cadence. Memory systems drift; the eval is the only way to catch the drift.

Most teams we look at skip steps two through four. They install Mem0 or Letta, watch a few interactions, decide it looks good, and ship. Then they spend the next quarter debugging mysterious behavior they cannot reproduce, because they have no evaluation harness and no ground truth.

The Honest Question: Do You Even Need Memory?

This is the section the field's marketing skips. We do not.

Most teams that arrive at Garden asking about memory do not need a memory layer. They need one of three things and they have rounded them up to "memory" because that is the word in the air:

Better retrieval over their own documents. This is RAG. If your agent feels stupid about your product, your pricing, your internal policies, the answer is usually that it cannot find or read those documents at the right moment. Better embeddings, better chunking, hybrid retrieval (semantic + keyword + entity), and a reranker will close more of the gap than any memory layer. The library you want is something like Voyage embeddings + Cohere reranker + a clean retrieval pipeline, not Mem0 or Letta.

Structured state, not memory. A lot of what teams call "memory" is actually domain state that should live in a normal database with an explicit schema. The user's plan tier. The deal stage. The current support ticket. The order history. Stuffing this into a memory system is a category error. Domain state belongs in Postgres with a schema you control, retrieved through explicit tool calls. "The agent should remember the user's plan" sounds like a memory problem; it is a query problem against a database that already knows the user's plan.

A longer context window plus summarization. For single conversations that need coherence over hundreds of turns without learning anything that persists beyond the session, the right answer is often "use a longer model and summarize aggressively." Sonnet 4.5's 200K window plus a periodic summarization pass handles a lot of cases that teams try to solve with memory.

When is memory the right answer? Three conditions have to hold simultaneously:

The relationship between agent and user extends meaningfully across sessions. A support agent that handles one ticket and forgets the user is fine without memory. A personal assistant that builds up a model of how you work over months might benefit.
The information that needs to persist is not better stored as explicit domain state. "Preferences for response length and tone" is memory. "Current subscription tier" is Postgres.
You have the evaluation discipline to know whether the memory system is helping. Without LoCoMo, LongMemEval, or a similar ground-truth eval running on a recurring basis, you have no way to know whether the system is improving, degrading, or holding constant.

When all three hold, memory is a real win. In our consulting work, we hit this combination maybe one engagement in four. The other three, the right answer was better RAG, better state management, or both.

"Add memory to your agent" has become a verbal tic, much like "use vector search" was three years ago. Both are sometimes right, both are often the wrong instinct, and both are oversold by the vendors who benefit from them being installed. Memory is expensive in tokens, latency, and operational complexity. It is leaky. It is hard to evaluate. Install it when it is the right tool. Do not install it when it is not.

When Memory Is the Wrong Answer

To make the previous section concrete, here are the patterns where we have specifically recommended against installing a memory layer in client work, even when the team came in convinced they needed one.

Customer support chatbots with per-ticket scope. A support agent handles a ticket, the ticket closes, the next ticket is largely independent. The right architecture is a CRM lookup at the start of each conversation, a structured state machine for the ticket, and RAG over the knowledge base. Most "memory" the agent needs is the user's account record, which is already in the CRM.

Sales and lead-qualification agents. Same shape. Deal state lives in the CRM. A memory layer creates a second source of truth that has to be reconciled with the CRM, which is a recipe for inconsistency.

Document Q&A. "Our agent answers questions about our policies." This is pure RAG. The agent does not need to remember anything across sessions; it needs to retrieve well. Investing in retrieval quality returns more than investing in a memory layer.

Single-task workflows. A coding agent that does one task per session. A research agent that produces one report. A data-extraction agent that handles one document. These are session-scoped. Memory would store facts that no one ever uses again.

Compliance-heavy domains where audit matters more than personalization. Medical, legal, financial advice. The win from "the agent remembers you" is usually outweighed by the audit cost of a system that can produce different answers for the same user over time. Memory adds risk here, not value.

Multi-tenant systems with high churn. If users come and go and per-user memory mostly accumulates without being used, the cost of maintaining the layer is not justified.

The pattern across all of these: when the value of "the agent remembers you" is low relative to the cost of a memory system, do not install one. The cost is not zero, and the marketing implies it is. When we audit, the question we put to teams is the one we put to ourselves: what is the minimum architecture that makes the agent useful? Memory is one of the layers we recommend against most often. Not because it is bad, but because it is the wrong tool for many of the workloads we see.

Frequently Asked Questions

What is the difference between RAG and AI agent memory? RAG retrieves from a static, externally curated corpus (your documents, your knowledge base). The agent does not write to it; you do. Agent memory is dynamic and agent-managed: the agent decides what to store, update, and surface from its own interactions. RAG handles "the model does not know our content"; memory handles "the model does not remember what the user said yesterday." Most teams asking for memory actually need better RAG.

Should I use Mem0 or MemGPT? Mem0 if you have a multi-user production system where extracted atomic facts are more valuable than raw conversation, and you want a library that drops in alongside an existing agent. MemGPT / Letta if you have long-running single-agent conversations where the OS analogy is illuminating and the team can absorb the conceptual overhead. Many teams should use neither and start with better RAG plus structured state in a database.

What is CoALA and do I need to read the paper? CoALA (Sumers et al., 2024) is the conceptual framework that gave the field its vocabulary: working memory, episodic memory, semantic memory, procedural memory. You do not need to implement CoALA (it is a framework, not a system), but you should read the first two-thirds of the paper before evaluating any memory product. It is the only way to know what category of memory the product actually handles and what it does not.

How do I evaluate a memory system? Run LoCoMo or LongMemEval on your candidate architectures with your model and your data, build a small ground-truth eval set from twenty real conversations, and measure both memory quality (was the fact stored and retrieved correctly) and decision quality (did the agent behave correctly given the context). Run the eval before launch and on a recurring cadence. Most teams skip this step and pay for it later.

Is A-Mem ready for production? For most teams, no. A-Mem is the most interesting recent architecture in this space, and it is the one we would default to least often in production. Graph operations are expensive, the behavior is harder to reason about than flat fact stores, and the evaluation discipline required to catch spurious links is significant. Read the paper. Prototype if you have a clear use case. Default to Mem0 or Letta for production if you need any memory layer at all.

Does memory add GDPR risk? Yes. A memory system stores user-derived data that has to be discoverable, exportable, and deletable on request. If your memory layer cannot honor a deletion request cleanly (delete a specific fact and any derivatives, propagate the deletion through the graph if there is one), it is not production-ready in the EU. Build the governance hooks before launch, not after.

Can I just use a longer context window instead? For session-scoped conversations, often yes. Sonnet 4.5's 200K window plus aggressive summarization handles a lot of cases that teams try to solve with memory. The cost is straightforward (input tokens) and the failure modes are visible. Memory becomes the right answer when the relationship genuinely extends across sessions and the persistent information is not better stored as explicit domain state.

Closing

The four architectures we walked through (CoALA as framework, MemGPT as OS-inspired system, Mem0 as production extraction layer, A-Mem as graph-structured frontier) are the serious answers the field has produced. Each is the right answer for some workloads. None is the right answer for most agent builds we audit.

The instinct to add memory is usually the wrong one. The upside is narrower than the marketing suggests, the cost is broader, and the layer duplicates state that often belongs in a database. It also obscures the question worth thinking carefully about before installing any architecture: what does your agent actually need to carry across time, and where does that state really belong.

When we do install memory, we install what matches the workload: Mem0 for multi-user systems where atomic facts matter, Letta for long-running single-agent conversations, occasionally a custom build when the off-the-shelf options do not fit. When we do not, we install better retrieval, better state management, and a clearer separation between what belongs in the context window, what belongs in RAG, and what belongs in Postgres.

If you are trying to figure out which of these matters for your team, that is the conversation we have in a Garden audit. Email a@gardenresearch.eu.