The Amnesia Tax

Every conversation with an AI agent costs tokens. Some of those tokens are your question. Most of them are context — the system prompt, workspace files, tool definitions, and conversation history the model needs to pretend it knows what’s going on. A typical OpenClaw session injects 15-50k tokens of context before you say a word. At Claude Opus pricing ($5/MTok input), that’s 7.5–25 cents per conversation just for the agent to remember its own name.

Without persistent memory, the agent re-discovers the same facts every session. Who’s on the team. What channels exist. Which domains are allowed for email. What was decided last Tuesday. Each rediscovery burns tokens, burns time, and — more insidiously — occasionally gets things wrong because the context files haven’t been updated to reflect the latest decision.

This is the amnesia tax: the cumulative cost of an agent that can’t learn.

Key Takeaway

The amnesia tax isn’t just token cost. It’s the drift between what the agent should know and what the context window actually contains at any given moment. Memory systems close that gap.

Stateless vs. Stateful: A Taxonomy

Most AI agents today are stateless. Each request arrives, gets processed against whatever context was injected, and the response disappears into the void. The next request starts fresh. This is fine for one-shot tasks — “summarize this PR,” “write a regex” — but it falls apart for anything that requires continuity.

Stateful agents maintain knowledge across sessions. The question is how:

graph TD
    A[Agent Session Starts] --> B{Has Memory System?}
    B -->|No| C[Read context files only]
    B -->|Yes| D[Read context files]
    D --> E[Search memory for relevant facts]
    E --> F[Inject recalled memories into prompt]
    C --> G[Process user message]
    F --> G
    G --> H[Generate response]
    H --> I{Has Memory System?}
    I -->|No| J[Response sent. Nothing learned.]
    I -->|Yes| K[Extract facts from conversation]
    K --> L[Store new memories]
    L --> M[Response sent. Knowledge retained.]


There are three common approaches to agent memory:

Approach How It Works Tradeoff
Context files Manually maintained markdown (MEMORY.md, USER.md) High-quality but manual. Doesn’t scale.
Conversation history Append full chat logs Complete but expensive. Grows linearly forever.
Semantic memory Extract and embed facts, retrieve by relevance Scales well. Requires infrastructure.

The interesting move is combining all three — context files for standing facts, conversation history for the current session, and semantic memory for everything in between.

How Semantic Memory Works

Semantic memory systems have two phases: capture and recall. During capture, the system watches conversations and extracts discrete facts. During recall, it searches stored facts by semantic similarity to the current conversation.

sequenceDiagram
    participant U as User
    participant G as Gateway
    participant LLM as LLM (Claude)
    participant M as Memory (Mem0)
    participant E as Embedder (Ollama)
    participant V as Vector Store (SQLite)

    Note over G: Session starts
    U->>G: "What did we decide about email domains?"
    G->>M: search("email domains", user="q-agent")
    M->>E: embed("email domains")
    E-->>M: [0.23, -0.41, 0.87, ...]
    M->>V: nearest neighbors(vector, k=5)
    V-->>M: ["acme.io and widgets.co only", ...]
    M-->>G: inject memories into prompt
    G->>LLM: [system + context + memories + user message]
    LLM-->>G: "We decided to restrict to acme.io and widgets.co..."
    G-->>U: response

    Note over G: After response
    G->>M: add(conversation, user="q-agent")
    M->>LLM: extract facts from conversation
    LLM-->>M: ["User asked about email domain policy"]
    M->>E: embed(new facts)
    E-->>M: vectors
    M->>V: upsert(vectors)

The critical insight is that the LLM does double duty: it generates responses and decides what’s worth remembering. This is more sophisticated than dumping everything into a database — the extraction step acts as an editorial filter. Mem0 uses the LLM to extract “memory-worthy” facts, compare them against existing memories for conflicts, and decide whether to add, update, or discard. It’s opinionated compression, not raw logging.

Tip

The extraction LLM doesn’t need to be the same model that generates responses. Using a smaller, cheaper model (like Sonnet) for memory extraction while keeping Opus for conversation is a common cost optimization.

The Embedding Layer

Embeddings are the bridge between human-readable text and machine-searchable vectors. When you store the memory “Alice’s preferred timezone is US/Pacific,” the embedding model converts it into a 1024-dimensional vector — a point in high-dimensional space where semantically similar concepts cluster together.

Later, when someone asks “what timezone is Alice in?”, that query gets embedded into the same space. The vector store finds the nearest neighbors, and “Alice’s preferred timezone is US/Pacific” surfaces as relevant — even though the words barely overlap.

This is the superpower of embeddings over keyword search: they understand meaning, not just text.

Local vs. Cloud Embeddings

There’s a meaningful architectural choice here:

Cloud Embeddings Local Embeddings
Latency 50-200ms per call 5-20ms per call
Cost Per-token pricing Zero marginal cost
Privacy Data leaves your network Data stays local
Maintenance None Model updates, server management
Quality Generally higher (larger models) Competitive for retrieval tasks

For a memory system that fires on every conversation turn — both recall and capture — the per-token costs of cloud embeddings add up surprisingly fast. With cloud embeddings at ~$0.02/MTok and an agent handling 50+ conversations per day, embedding costs alone can exceed $10/month. Local embeddings eliminate this entirely at the cost of ~670MB of disk space and negligible CPU.

A model like mxbai-embed-large — 334M parameters, 1024 dimensions, running locally through Ollama — matches or exceeds OpenAI’s text-embedding-3-large on standard retrieval benchmarks. For a memory system, where the primary operation is “find the most relevant stored fact for this query,” retrieval quality is the only metric that matters.

Definition

Embedding dimensions determine the resolution of the semantic space. More dimensions capture finer distinctions but require more storage and computation. 1024 dimensions is the current sweet spot for retrieval — enough resolution for nuanced similarity without the diminishing returns of 3072+ dimensions.

The Storage Question

Once you have vectors, you need somewhere to put them. The vector store landscape is crowded — Pinecone, Qdrant, Chroma, Weaviate, pgvector, LanceDB — and the choice matters less than you’d think for small-to-medium memory systems.

graph LR
    subgraph "Server-Based"
        Q[Qdrant]
        P[Pinecone]
        C[Chroma]
    end

    subgraph "Embedded"
        S[SQLite + brute force]
        L[LanceDB]
    end

    subgraph "Database Extensions"
        PG[pgvector]
    end


For an agent running on a single machine with fewer than 100K memories, an embedded solution wins on simplicity. SQLite with brute-force nearest-neighbor search handles tens of thousands of vectors in single-digit milliseconds. You don’t need a separate server process, you don’t need network calls, and your data is a single file you can back up with cp. Brute-force search over 10,000 1024-dim vectors takes ~2ms on an M-series Mac. Approximate nearest neighbor (ANN) indexes like HNSW only become necessary above ~100K vectors.

The scale-up path exists when you need it — migrate to pgvector for database integration, or Qdrant for dedicated vector operations — but premature infrastructure is premature infrastructure.

Memory Lifecycle

A memory system isn’t a write-once archive. Memories have a lifecycle: they’re created, they’re recalled (which validates their utility), they conflict with newer information, and sometimes they become stale.

stateDiagram-v2
    [*] --> Extracted: LLM identifies fact
    Extracted --> Compared: Check against existing memories
    Compared --> Added: New fact, no conflicts
    Compared --> Updated: Conflicts with existing memory
    Compared --> Discarded: Duplicate or trivial
    Added --> Active
    Updated --> Active
    Active --> Recalled: Used in a session
    Recalled --> Active
    Active --> Stale: No recalls in 90+ days
    Stale --> Reviewed: Health check flags it
    Reviewed --> Active: Still relevant
    Reviewed --> Deleted: Outdated
    Discarded --> [*]
    Deleted --> [*]

The conflict resolution step is where most memory systems differentiate. When a new conversation reveals that “the team now uses LanceDB” but an existing memory says “the team uses in-memory vector storage,” the system needs to decide: update the old memory, keep both, or flag the conflict for human review.

Mem0 handles this through the extraction LLM — it sees both the new information and the existing memories, and decides whether to ADD, UPDATE, or ignore. This is imperfect (LLMs make judgment calls, and judgment calls can be wrong), but it’s dramatically better than the alternative of unbounded accumulation.

Warning

Memory systems that only add and never update or delete will accumulate contradictions. A memory from March saying “we use Google embeddings” sitting alongside a memory from April saying “we switched to Ollama” creates confusion. The extraction LLM’s ability to resolve conflicts is not a nice-to-have — it’s load-bearing.

The Value Proposition

So what does all this buy you? Three things, in order of how quickly you notice them:

1. Reduced context overhead. Instead of cramming every possible fact into workspace files (which get injected into every session), you store facts in memory and recall only what’s relevant. A session about email policy doesn’t need to load Slack channel IDs. A session about deployment doesn’t need team member bios.

2. Organic knowledge accumulation. The agent learns from every conversation without anyone manually updating MEMORY.md. Decisions get captured. Preferences get recorded. Corrections persist. Over weeks, the agent develops a progressively richer understanding of the team, the codebase, and the domain.

3. Institutional continuity. When a new team member joins, the agent can answer “what did we decide about X?” with actual history — not just what someone remembered to document. When the PM arrives, they inherit months of captured decisions, preferences, and patterns.

Example: Memory in Practice
@Alex Hey Q, what embedding model are we using for memory?
@Q mxbai-embed-large, running locally through Ollama. 1024 dimensions, about 670MB on disk. We switched from Google's gemini-embedding-001 on March 20th after the billing issue. The vector store is SQLite-backed at ~/.mem0/vector_store.db.
@Alex When did we decide that?
@Q Last Thursday night. You and I evaluated mxbai-embed-large vs snowflake-arctic-embed. Went with mxbai — better MTEB benchmarks for retrieval, single well-tuned model vs. a size suite. I re-indexed 78 existing memories, which took two partial runs because the exec timeout kept killing the process. All 150 memories are indexed now.

That second response — with the date, the alternatives considered, and the specific reasoning — isn’t coming from a context file. It’s coming from memories captured during the actual conversation where the decision happened.

Architecture for Small Teams

For a team of 5-10 people with one AI agent, the architecture is refreshingly simple:

graph TD
    subgraph "Agent Host (Mac Studio)"
        G[OpenClaw Gateway]
        M[Mem0 Plugin]
        O[Ollama Server]
        VS[(SQLite Vector Store)]
        HDB[(History DB)]
    end

    subgraph "External Services"
        A[Anthropic API]
        SL[Slack]
    end

    SL -->|messages| G
    G -->|conversation| A
    A -->|response| G
    G -->|extract & recall| M
    M -->|embed| O
    M -->|store/search| VS
    M -->|audit trail| HDB
    G -->|response| SL


The only external dependency is the LLM API. Embeddings are local (Ollama). Storage is local (SQLite). The memory system adds no new cloud services, no new billing accounts, and no new failure modes beyond “is the Ollama process running?” Running Ollama as a macOS LaunchAgent with KeepAlive=true ensures it survives reboots and process crashes. Total resource footprint: ~670MB disk, negligible idle CPU, ~200MB RSS when serving requests.

This is Gall’s Law in practice: a complex system that works, evolved from a simple system that worked. Start with context files. Add semantic memory when context files stop scaling. Add local embeddings when cloud embedding costs or privacy concerns justify it. Each layer proves itself before the next one arrives.

Key Takeaway

The best infrastructure is the infrastructure you forget is there. A memory system should be invisible in normal operation — the agent just knows things it learned last week. The only time you notice it is when you ask “how did you know that?” and the answer is “I remembered.”

What Could Go Wrong

No system is without failure modes. Memory systems introduce a few worth naming:

Stale memories. A fact captured in January that’s no longer true in March. Without active maintenance (health checks, freshness reviews), stale memories poison the well. The agent confidently states something that was true six months ago.

Duplicate accumulation. If the extraction or dedup logic isn’t robust, the same fact gets stored multiple times with slightly different wording. This wastes storage and can skew search results toward over-represented topics.

Hallucinated memories. The extraction LLM might infer a fact that wasn’t explicitly stated — “user prefers dark mode” from a conversation about UI themes. These phantom memories are rare but insidious because they look authoritative.

Embedding drift. If you change embedding models (say, from 1536-dim Google embeddings to 1024-dim Ollama), all existing vectors become incompatible. You need to re-embed everything — a migration, not an upgrade.

The mitigation for all of these is the same: automated health checks. A weekly cron that tests search quality, counts duplicates, flags stale entries, and verifies the embedding service is alive. Monitor the system that monitors everything else.

Summary

  1. The amnesia tax is real — stateless agents re-discover facts every session, burning tokens and occasionally getting things wrong.
  2. Semantic memory closes the gap — LLM-powered extraction captures facts from conversations; embedding-based retrieval surfaces them when relevant.
  3. Local embeddings are viable — models like mxbai-embed-large match cloud quality for retrieval tasks at zero marginal cost.
  4. Start simple — SQLite vectors, local embeddings, automated health checks. Scale infrastructure only when you've proven the value at each layer.
  5. Maintain what you build — memory systems need monitoring. Stale facts, duplicates, and embedding drift are predictable failure modes with straightforward mitigations.

Discussion Prompts

  • What facts does our agent re-discover most often? Which of those should be in persistent memory vs. context files?
  • How should we handle memory conflicts — should the agent auto-resolve, flag for review, or keep both versions with timestamps?
  • As we add more team members, should memory be per-user, per-project, or shared? What are the privacy implications of a shared memory store?

References

  1. Mem0 Open Source Documentation — Architecture overview and configuration guide for self-hosted Mem0 deployments.
  2. mxbai-embed-large-v1 Technical Report — Mixedbread's BERT-large embedding model, SOTA on MTEB at time of release, outperforming OpenAI text-embedding-3-large.
  3. Ollama — Local LLM and embedding model server. Supports macOS, Linux, and Windows with GPU acceleration.
  4. Anthropic Prompt Caching — How prompt caching reduces the token cost of repeated context injection — the primary cost driver for stateless agents.
  5. MTEB Leaderboard — Massive Text Embedding Benchmark. The standard comparison for embedding model quality across retrieval, classification, and clustering tasks.
  6. Gall's Law — "A complex system that works is invariably found to have evolved from a simple system that worked." The architectural principle behind incremental memory system design.