Field Notes

The Attack That Waits: When AI's 'Memory' Becomes the Attack Surface

OWASP put memory poisoning in its 2026 agentic top ten; a single poisoned webpage can make an agent misfire weeks later. fibon sells memory as a core feature — this cut lands right on the vital spot.

📅 2026-05-13 ⏱ 16 min 📖 Chapters 3, 4 🔬 Deep Dives E

Quick summary: Memory poisoning turns a one-shot prompt injection into a durable control channel — malicious content written into an agent’s long-term memory misfires weeks later in an unrelated conversation. OWASP added it to the 2026 agentic top ten (ASI06). fibon sells “cross-session memory” as a core feature, so this attack surface lands right on the vital spot. An honest test of which fibon designs count as defenses and which are simply untested by a real attack.

Skip this if: your AI tool has no long-term memory (every conversation starts clean) — this attack surface doesn’t exist for you.

Where it differs from prompt injection

Prompt injection is familiar by now: hide “ignore previous instructions” in the input, trick the LLM into doing what it shouldn’t. But it has a natural limit — it’s stateless: attack and harm happen in one conversation, and it ends when the chat closes.

Memory poisoning removes that limit. Its viciousness is temporal decoupling: the attacker writes malicious content into the agent’s long-term memory in February, and it triggers in April in a completely unrelated conversation. Security researcher Christian Schneider put it precisely: memory poisoning “turns a transient exploit into a durable control channel.”

Worse, it bypasses the “every conversation starts clean” defense — because it exploits exactly the long-term, cross-session state.

Why is this especially lethal for agents? Because an agent’s memory layer (conversation history, vector stores, RAG indexes) is by default “accepts writes but doesn’t validate” — no authentication, no integrity check. When your agent summarizes an email, remembers a “preference,” or stores a document into its knowledge base, it doesn’t ask “does this content smuggle in instructions meant to shape my future behavior?”

Has it actually been built? Look in three layers

Layer one: demonstrated academically. The eTAMP paper (arXiv:2604.02623, 2026-04) showed in the WebArena academic environment that “poisoning only environment observations (e.g., letting the agent browse a manipulated product page), with no direct memory access” achieves cross-session, cross-site poisoning, with an attack success rate up to 32.5%. The earlier AgentPoison (NeurIPS 2024) reached an average 80%+ success rate on RAG-based agents, with a poison rate under 0.1%, and no model retraining.

eTAMP reaches 32.5% on WebArena; AgentPoison exceeds 80% on a RAG agent — Memory-poisoning attack success rates in academic settings 資料來源：eTAMP, arXiv:2604.02623; AgentPoison, arXiv:2407.12784 (NeurIPS 2024)

Layer two: vulnerability disclosures against real products — this layer carries the most weight. Cisco disclosed MemoryTrap against Claude Code: the attack path is frighteningly mundane — you clone a repo, let the agent help, approve one dependency install, and walk away. The malicious payload doesn’t stay in the project; it reaches persistent memory and global hooks config, and one action shapes the agent’s future behavior across sessions, projects, and reboots. The key corroboration: after Cisco’s disclosure, Anthropic removed user memories from the system prompt in Claude Code v2.1.50, closing that high-trust override path. This isn’t theory; it’s a real vulnerability with a fix on record.

Layer three: theory and trend framing. In December 2025, OWASP formally added “Memory & Context Poisoning” to the agentic application top ten (ASI06); “May 2026, the moment memory became the attack surface” is a commentary frame — a convergence of first-half-2026 events (the OWASP classification, academic papers, an official follow-up article) into a trend narrative, not a specific May attack.

What this means for fibon

This cut lands precisely, because fibon sells “cross-session memory” as its most core feature — Chapter 3 is entirely about “letting AI remember you firmly across different conversation windows.” The stronger the selling point, the larger the attack surface. An honest teardown of where fibon stands on this line:

A defense that already counts — provenance. fibon’s memory doesn’t dump conversations in wholesale; it breaks them into structured cards, and ADR-026 introduced an “agent_generated” determination — a memory entry is tagged by whether it came from the user, the system, or the agent itself. This is exactly what Schneider calls “the foundation of defense”: you must first know where a memory “came from” before you can later trace “which email, which document introduced the anomaly.” Deep Dive E’s “subject attribution and cognitive containers” is, at its core, maintaining a “who said this, about whom” boundary in the memory layer — refusing to let untrusted-source content masquerade as the user’s facts.

A defense that partly counts — contradiction detection and conflict arbitration. fibon’s memory system has a contradiction detector (post-retrieval, it scans same-tag/different-content and injects a <contradiction_alert>) and a conflict arbitration queue. If poisoned content contradicts existing memory, this layer has a chance to surface it for the user. But its design intent is handling “facts changing over time” (you moved, you changed jobs), not specifically countering malicious poisoning — whether it catches an attack depends on whether the poison conflicts with existing memory.

I must flag what fibon hasn’t built: I have provenance tagging, but no “instruction stripping” before a memory write — no dedicated checkpoint that asks “this content about to enter long-term memory, does it look like a legitimately learned preference, or does it smuggle instructions meant to shape my future behavior?” In Schneider’s four-layer defense (input trust scoring → pre-write sanitization → trust-aware retrieval → behavioral monitoring), fibon is more complete on layers one and three (prototypes), while layer two’s write-ahead validation is basically empty. In other words: fibon knows where each memory came from, but doesn’t yet vet what it “wants to do” before storing it. Porting part of ADR-010’s static scan for skills onto the memory-write path is the most direct fix for this line — and it isn’t done.

One architectural buffer is worth noting: fibon handles untrusted content (browsing webpages, reading external documents) through the isolated Worker and the trusted/untrusted MCP tiering (Chapter 6), so untrusted-source content doesn’t directly enter the decision layer with high trust. This reduces the “a malicious webpage directly poisons memory” path — but doesn’t fully seal it, because the agent must ultimately summarize and internalize what it reads into memory, and that summarization step is itself an entry point for poisoning.

The one idea to take away from memory poisoning: the feature that makes an agent powerful (learning and memory) and the attack surface that makes it fragile are the same thing. You can’t take memory’s upside without its risk. fibon’s bet is that “structured cards + provenance + auditability” holds up better under attack — and is clearer to investigate — than “blurring conversations into a black-box memory.” Whether that bet is right won’t be known until someone actually attacks it — and I’d rather write that sentence here than pretend the memory system is safe by birth.

Where it differs from prompt injection

Has it actually been built? Look in three layers

What this means for fibon

Sources