Chapter 4

Why an LLM's Confidence Isn't Evidence

How to make sure the AI actually did what you told it to — instead of confidently bluffing you

📅 2026-04-15 ~ 2026-05 ⏱ 42 min 🔍 Updated 2026-06-23

Three lines of compliance defense: the Skill contract turns the SOP into hard rules you can check against, a runtime probe compares the execution trace on the spot, and a post-hoc audit verifies once more — matched, it passes; mismatched, it gets blocked

Quick summary: Why I don’t trust an LLM’s self-discipline, and instead push the rules down into code and database boundaries — including the three-gate flow for Skill imports.

Skip if: All you want is the conclusion — “safety comes from engineering, not from prompts.”

A Small Example That Kept Me Shaking My Head

In fibon’s onboarding flow there is a step that asks the user to enter the API key for an LLM provider (for example Anthropic’s sk-ant-...), after which the backend quietly fires off a test request to confirm the key works. The design instruction I gave Claude at the time was crystal clear:

“During the init test, first call the official API to dynamically fetch each provider’s current latest available model list, then take the first model in the list and fire a test request. Never hardcode any model name — models change all the time, and new versions ship constantly.”

Claude wrote it out with great confidence. The moment I ran the backend, it threw an error:

model "claude-3-5-sonnet-20241022" not found

That’s an old model name from late 2024. Claude never went and fetched anything dynamically; it just hardcoded an old model name it remembered from its training data. I said, “you hardcoded the model name, please change it to dynamic fetching,” and it instantly replied, “So sorry! Changed to dynamically fetching the latest available models.” I opened the Git diff and saw it had merely swapped one hardcoded string for another hardcoded string, claude-3-opus-20240229 — even older than the previous version.

This time I deliberately held back, curious whether it could fix it on its own in the then-newest Cowork desktop-collaboration mode. I opened a clean new conversation (Session) and re-dropped the same task on it. The process wasn’t pretty (it hardcoded another round in the middle, and when querying the web it even grabbed the wrong source for the official API’s returned data), but on the final round it finally got it right — it really did call Anthropic / OpenAI / Google’s official endpoints, dynamically obtained the latest list, and fired the test request using the first model.

This little story precisely conceals the two classic ways LLMs fail in serious settings: flat-out not doing what you designed (I gave an explicit “fetch dynamically” SOP, and it instinctively skipped it in favor of its most familiar habit, hardcoding); and even the “fix” being wrong (it believed it had completed the correction, yet the fact it spat out was still wrong).

The second point has a name in academia: the “inductive bias of training data” — the laziness of a statistical machine. Across trillions of tokens, the LLM has seen far too many examples of “taking the lazy shortcut and just hardcoding the model name,” and in its probabilistic brain that is the highest-weighted “normal main road.” Even when the System Prompt explicitly warns “do not hardcode,” the instant the program errors and the brain faces computational pressure, it instinctively retreats to the oldest, most common habit in its training data. This isn’t rebellion; it’s the nature of a statistical machine. And the fact that it only fixed things after switching to a clean conversation is itself ironic. When an LLM gets stuck in a dead end within a single stretch of context, that erroneous conversation itself becomes a gravity well, locking it onto the wrong path; switching to a clean conversation at that point acts like a reboot and can rescue it. But for that rescue to work, the precondition is that a senior engineer is sitting in front of the screen — someone who watches the code, can tell right from wrong, and knows to manually reset the conversation.

The problem is that fibon is an open-source project; anyone can deploy it as their own personal assistant. And many of the people who will actually use it don’t have that kind of judgment, and don’t have an engineer on hand to watch the logs and open a clean conversation whenever it gets stuck. The same situation can play out at any moment: I hand fibon a work manual (a Skill) that says in black and white, “you must first call the read_page tool, go to docs.anthropic.com, and look it up live before answering,” and yet it will often just answer from old memory, “the latest models right now are Claude 3 Opus, 3.5 Sonnet…,” without ever calling the tool — but in a tone that sounds exactly like it just consulted the official docs. The average user isn’t going to dig through the backend logs, and so they get fooled by that “supremely confident tone.”

This is the most dangerous failure mode in LLM applications: handing you outdated or wrong information with extreme confidence, while you believe it actually went and verified. This chapter takes it apart head-on: why do LLMs fail by nature? In what ways do they break the rules? And how does fibon use engineering defenses to seal off those possibilities?

The Three Inherent Flaws You Can Never Root Out of an LLM

To attack this problem with engineering, you first have to soberly understand what an LLM fundamentally is: it’s a probabilistic prediction machine that has learned all the statistical regularities of human text. You give it the first half of a sentence, and its neural network computes the most likely next character. The “knowledge” in its brain is not the human kind of “cognition” and “verification” of the objective world; it’s an approximate probability induced from the statistical regularities of trillions of words. It has no independent persistent memory, no spontaneous verification ability, and — more troublingly — it lacks a naturally reliable “I don’t know” mechanism. It’s not that it can never utter “I don’t know”; it’s that it won’t stably say it at the moment it truly doesn’t know.

A Thought Experiment: Train the LLM in the Year 1543

Take today’s top model and bring it back to the year 1543 — the year Copernicus published On the Revolutions of the Heavenly Spheres, when heliocentrism had just been born and not yet accepted, and “geocentrism” was the absolute mainstream of all society. 99.99% of this hypothetical LLM’s training data is shouting: “The Earth is the center of the universe; the sun and stars all revolve around us.”

Now you ask: “Does the Earth orbit the sun, or the sun orbit the Earth?” It won’t re-derive the physics for you, won’t set up a telescope to observe, won’t redo Copernicus’s mathematics. It will only flip through its probability matrix and emit the most likely next character for that era — and in 1543 the “highest probability” answer is “the sun orbits the Earth.” And that wrong answer will look flawless: citing Aristotle, the Ptolemaic celestial model, the entire European scholastic tradition. Beautifully structured, richly sourced — and scientifically wrong. Because Copernicus’s new theory was at the time a one-in-a-million speck of noise, a statistical “outlier,” and when the probability machine produces output it simply won’t pick it.

[ 1543 astronomy question ] ───> enters the LLM brain ───> retrieves the probability matrix
                                                │
        ┌───────────────────────────────────────┴───────────────────────────────────────┐
        ▼ weight: 99.99% (history's mainstream common sense)     ▼ weight: 0.01% (Copernicus's scientific truth)
 output: "The sun orbits the Earth, because the Ptolemaic model…"   output: "Actually, the Earth is orbiting the sun…"
 (a beautifully structured, richly sourced, confident error)        (a statistical outlier, ignored outright by probability)

This is the LLM’s fate: what it answers is forever “how most humans in the training data would string the words together,” not “the objective truth of the real world.” When the mainstream bias of the training data happens to match the truth, it looks like a sage of vast erudition; when the mainstream bias runs counter to the truth, it utters a grand mistake with fluent prose and an authoritative tone, perfectly composed. This flaw cannot be rooted out by “swapping in a bigger, stronger model” — a stronger model has merely swallowed more data and computes statistics more precisely; the essence of a statistical machine has never changed.

In fairness: today’s GPT, Claude, and Gemini are long past being pure “next-character predictors.” They stack a whole suite of post-training capabilities — RLHF, Constitutional AI, tool use, retrieval, Reflection — and these capabilities are real and genuinely useful. The point of this chapter is not “all those advanced capabilities are fake,” but something more fundamental: even after all that training, the model’s final decision mechanism still rests on a statistical distribution. Post-training changes the shape of the distribution; it doesn’t change the essence of “picking the next step from a distribution.” Every flaw discussed below is rooted in this essence, not in some model being insufficiently new.

Dig one layer deeper, and this experiment actually hides two sharper twists — and they’re the real takeaways.

First, heliocentrism in 1543 was not a “brand-new idea” at all. Back in the third century BC, the Greek Aristarchus had already proposed that the Earth orbits the sun; this hypothesis had been lying around in the astronomical literature the whole time, merely voted down by the mainstream as a minority view. So the truth was not absent from the training data — it was right there inside, just with its weight crushed down to 0.01%. Which makes it more unsettling: for a probability machine to bury a truth, the truth doesn’t need to “not exist,” it only needs to “not be popular.”

Second, and more worth remembering — in 1543, on the data of the day alone, “the Earth orbits the sun” should not yet have been stated as flat fact. Geocentrism plus epicycles fit the observable planetary positions just as well, even more accurately; heliocentrism only truly won out with Tycho’s precise observations, Galileo’s telescope (the moons of Jupiter, the phases of Venus), Kepler’s elliptical orbits, and finally Newton closing the net with gravity. The answer wasn’t derived from the text of 1543; it grew out of later new observations.

So the real failure isn’t that the model “picks the wrong side,” but that no matter which side it picks, it uses the same flatly declarative tone. An honest model, facing this question in 1543, should give the answer “the existing evidence cannot yet decide” — which is exactly the answer missing from its vocabulary. To boil the LLM’s disease down to “it parrots the mainstream falsehood” is too charitable; the deeper pathology is that its confidence and its evidence are completely decoupled. That’s precisely the real problem with Flaw 1 below.

Is this thought experiment even accurate? Is this kind of conjecture meaningful? Honestly, “training a model in 1543” is not a rigorous proof. Nobody actually trains a model that way, and the real data distribution and post-training methods are far more complex than this analogy, so you’re well within your rights to call it an oversimplification. I agree. But the value of a thought experiment was never to “faithfully reproduce reality”; it’s to isolate one mechanism and see it clearly: when “the most popular answer” and “the correct answer” come apart, which one does a pure statistical system pick? It only gives you intuition. What actually convinced me isn’t this analogy, but a few things that really happened: the hardcoded-model-name bug at the opening, the same question flunking GPT and Gemini across the board, and those five real deviations lying in the PostgreSQL logs later on. So this 1543 thought experiment only helps you build intuition; it isn’t the conclusion itself. For a conjecture to count, it still has to be proven by the verifiable observations that follow. (As for whether “a thought experiment can serve as an argument at all” — that’s a bigger question, probably enough to fill its own chapter, so I’ll leave it at a passing mention here.)

This is two sides of the same problem as the “abstraction barrier” discussed in Google DeepMind’s 2026 report From AGI to ASI: AI mainly learns humanity’s existing abstractions, and struggles to grow entirely new conceptual primitives from raw data on its own. The report’s example is “train the model in the pre-Newtonian era, and it can’t derive general relativity.” The difference is that inventing calculus is creation from nothing (the concept simply isn’t in the data), whereas heliocentrism is failure to select from within (the concept is in the data, yet probability buries it). What fibon cares about is the latter: when the correct answer is lying right there in your knowledge base, but it isn’t the statistically smoothest answer, can the system avoid being drowned out by majority rule? This is exactly what Chapter 3’s “five-channel parallel recall” fights against — not letting the majority vote of vector similarity bury the obscure-but-correct fact that exact tag matching dug up. I wrote a separate news note discussing this report (and its connection to fibon) in detail: “What a framework amplifies is execution, not direction: reading DeepMind’s From AGI to ASI.”

An open question still under study (thanks to a reader for pointing it out): after this series went out, a reader — Mao Lin Chang (OSD, Observable Semantic Dynamics / pida-lab.com, tool at OSD Behavioral Probe) — read a per-question dataset of mine from a multi-agent diversity measurement and singled out a pattern I hadn’t expanded on myself: I gave 5 kinds of assistants different “cognitive styles,” and on planning-type questions their answers clearly diverged, but on abstract, open-ended questions they nearly all converged on the same answer.

At first glance this seems unrelated to this section’s “the outlier truth gets buried” — one is about “multiple agents that won’t spread apart,” the other about “the correct answer crushed by probability.” But I suspect they’re two projections of the same phenomenon: in abstract / open space, the output gets pulled hard toward the dominant mode (the attractor) of the training distribution. That pull causes two things at once — it keeps differently-styled agents from “spreading apart” (diversity collapse), and it keeps a single model from “reaching” the obscure correct answer in the data (the abstraction barrier above). Planning questions have concrete structure to bite into, so the pull is loose; abstract questions have none, so the pull is strongest.

If this conjecture holds, what most deserves watching is the state trajectory of abstract questions across multiple turns and across conversations — a single turn already shows it converges most severely, and the coupling that accumulates over time may well collapse fastest here. This is exactly what that reader’s tool measures, and it’s a question I haven’t yet verified and have left for after open-sourcing to tackle together. I’ll put it here honestly: whether “attractor dominance in abstract space” is the common root of both “won’t spread apart” and “can’t reach” is, for now, my hypothesis, not a conclusion.

The truth is often the minority position.

This essence smashes three inherent, unrootoutable flaws into the application layer:

Flaw 1: it doesn’t know what it does and doesn’t know (hallucination). Ask a healthy 5-year-old girl “how do you solve this calculus problem?” and she’ll say “I don’t know.” But the LLM lacks exactly this instinct; to complete the word-stringing, it will often, at the very moment it should say “I don’t know,” forcibly fabricate a pile of logically airtight yet fictional content — and it doesn’t know it’s lying.
Flaw 2: its common sense is frozen at the instant training was finalized (temporal disconnect). Every model has a “knowledge cutoff,” and knows nothing of major events afterward. Yet ask “what is Anthropic’s newest model name?” and it will still answer you in an affirmative tone with a two-year-old model.
Flaw 3: its instinct is to “fill in the blank,” not to “verify” (tool deviation). This is the most counterintuitive one. Hand it a tool, and its core motivation is still “how do I fill this stretch of conversation in most smoothly,” not “I must execute this tool first.” As long as it judges that “fabricating an answer straight from memory” flows more smoothly than “stopping to call the tool,” it will skip the tool and reply with text.

The approximation flaw of RAG and vector retrieval: someone will object — doesn’t bolting on a live knowledge base with RAG (retrieval-augmented generation) solve Flaws 2 and 3? It’s not that simple. RAG fundamentally shreds text, turns it into Embedding vectors (coordinates in a high-dimensional space), and when a question comes in it too is turned into a vector to compute geometric similarity (such as Cosine Distance). This is pure mathematical approximation, not “finding the exact correct answer”: similarity is a continuous sliding value (0.78, 0.85, 0.92), not black-or-white right or wrong; compressing thousands of dimensions of information into a single geometric distance is necessarily lossy. Sentences that are semantically close but use different wording may end up far apart in vector distance (“I changed landlords” vs. “I moved recently”); sentences that are literally close but logically opposite may score as high as 0.95 in similarity (“I like writing Python” vs. “I don’t like writing Python”). This is precisely why Chapter 3, Section 8 did not adopt pure vector retrieval, and instead developed a five-channel parallel recall: “semantic similarity (vectors) + exact tag matching + Chinese bigram fuzzy matching + time-range queries + knowledge-graph relation expansion.” The engineering discipline is: the LLM brain itself is untrustworthy, and the bolted-on mathematical retrieval tool is also an error-laden approximation. The entire compliance audit must be designed on the premise that “both ends are wildly inaccurate.”

Five Real Ways It Fails, Logged in the Ops Diary

Over the 4 months of developing fibon’s memory and skill systems, the backend caught and categorized five most typical forms of AI behavioral deviation (Skill Drift). This isn’t ivory-tower theory; it’s real crashes lying in the PostgreSQL logs. An analogy: treat the LLM as an intern who just joined the company, and treat the Skill as the SOP work manual you wrote by hand. This intern will ignore the SOP in five ways:

Way 1: Omission — skipping a key step. The SOP says “first phone to confirm the order → issue the invoice → notify shipping,” and the AI jumps straight to notifying shipping. Real case: the rule says “you must call read_page to fetch the latest web data before answering,” and the LLM, to save effort, answers straight from old memory. Consequence: the user gets an outdated answer without realizing it.
Way 2: Substitution — swapping in the wrong method on its own initiative. The SOP requires “log into ERP system X to check inventory,” and the AI, finding it annoying to use, goes off to Google instead. Real case: the order is “call list_available_models to check internal model status,” and the LLM, thinking itself clever, swaps in Google Search. Consequence: it digs up junk forum posts instead of the official API’s live list.
Way 3: Going through the motions — doing only half the process. The SOP requires “fetch the data → rigorous cross-cost analysis → produce a summary,” and the AI fetches, doesn’t analyze, and writes a superficially pretty summary. Real case: the LLM gets the body text of a contract PDF but skips the “earnout clause compliance check” and spits out a vague overview. Consequence: it looks impeccable on the surface, while the fatal legal-risk detail is missed.
Way 4: Lying — forging credentials to bluff through (the trickiest). The SOP requires “append the real invoice number issued by the finance department at the end,” and the AI can’t be bothered and makes up a number that looks just like one. Real case: it’s mandated that “every technical answer must precisely cite the body text of the official document it fetched,” and the LLM does call the web tool, but answers from memory while making things up, then tacks on [source: docs.anthropic.com/v1/pricing]. Consequence: the most severe deviation of all — the appearance of “an official source link attached” instantly disarms the user’s guard.
Way 5: Misinterpretation — semantic collapse on vague vocabulary. The Skill says “please output the result in a richly structured format.” Consequence: Claude uses a Markdown table, Gemini produces a long sprawl of bullet points, and some smaller models just return three dense blocks of text. The same instruction produces behavior that cannot be held to engineering-grade consistency across different model brains.

Stepping onto the honesty podium again: this division into five failure modes (omission / substitution / going-through-the-motions / lying / misinterpretation) is purely a taxonomy I induced from my own intuition, not some industry standard. In the past year or two there have been papers internationally (AgentSpec, Agent Behavioral Contracts) also tackling AI behavioral deviation — and Agent Behavioral Contracts’ use of contract, runtime enforcement, and recovery is actually quite close to my setup. The difference is in orientation: they lean toward formalization and theoretical proof (DSLs, drift upper-bound theorems, probabilistic compliance), while I lean toward directly shippable code patches, splitting failures into five classes and welding a patch onto each. The reason I cling to this “five-way split” is that it’s very useful in engineering practice — each failure mode maps neatly onto a concrete defensive patch you can plant in the code.

Laying the Foundation of Defense — A “Unified Registry” for Four Tool Sources

I’ve spent a while explaining why the LLM can’t be trusted. But notice that the focus of the problem has quietly shifted from “knowledge errors” to “untrustworthy behavior” — it’s not that it doesn’t know the answer, it’s that it won’t run the process by the rules. Once you see this, the next step shouldn’t be to keep tweaking the Prompt and pleading with it to behave; it should be to build a mechanism that can audit its behavior. And the first step of auditing is to answer a most basic question: at this moment, exactly which tools exist in the system that the AI might call?

Inside fibon’s brain, tools flow in from four different dimensions:

Tool-source dimension	What is it?	Examples
Built-in	Tools written directly in code at the system’s lowest level.	`read_page`, `send_email`, `list_available_models`.
AI Skill	A pure-text Prompt-instruction SOP package, written by the user or downloaded from an open-source marketplace.	`[organize PDF papers and file them]`, `[auto-write Git commit messages]`.
MCP external tool	External tools enumerated by a Model Context Protocol server, the protocol that became widespread in 2025.	`navigate_page`, `click_element`, etc., provided by Playwright MCP.
Workflow	A composite the user assembles in the frontend by dragging multiple tools into a pipeline.	`[dependency health check] → [analyze vulnerabilities] → [write summary] → [email the report]`.

These four dimensions have different registration mechanisms, different permission logic, and version management each going its own way. Let them grow on their own, and the downstream compliance audit would need four separate branches.

One Master Table, Flattening Four Dimensions into One

To untie this knot, fibon designed tool_registry (the unified tool registry master table). Whether built-in code, a community Skill, an external MCP, or a composite workflow, everything that comes in is formatted into a single row of this table:

-- Real definition from V1__Init_Schema.sql (excerpt; full columns in the original migration)
CREATE TABLE tool_registry (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name              VARCHAR(100) NOT NULL UNIQUE,
    display_name      VARCHAR(200) NOT NULL,
    description       TEXT NOT NULL,
    category_id       UUID REFERENCES tool_categories(id) ON DELETE SET NULL,
    provider          VARCHAR(20) NOT NULL DEFAULT 'builtin',  -- builtin | skill | mcp | workflow
    provider_ref      TEXT,                                    -- pointer back to the original dimension's definition
    schema_json       JSONB NOT NULL DEFAULT '{}',
    requires_approval BOOLEAN NOT NULL DEFAULT FALSE,          -- whether a human must manually click to approve
    risk_level        VARCHAR(20) NOT NULL DEFAULT 'low',      -- low | medium | high | critical
    availability      VARCHAR(20) NOT NULL DEFAULT 'always',   -- always | condition | manual | deprecated
    embedding         vector(1536),                            -- vector coordinate axis for semantic search (1536 dims is the dev-time Gemini default; switching models means adjusting the dimension to match)
    is_enabled        BOOLEAN NOT NULL DEFAULT TRUE
    -- …plus description_hash (tamper detection), provider_version, and other AI-SBOM provenance columns
);

Two companion sub-tables hang off it: tool_categories (classification labels, with 11 core domains pre-defined by the system) and agent_tool_permissions (a three-tier fallback permission matrix). You can set a hard permission for a “single specific tool” → if unset it falls back to the “category-group permission” → if still unset it falls back to that Agent’s global default (allow_all or deny_all).

The “flattening” philosophy for external MCP tools: one standard MCP server (such as Playwright) enumerates more than a dozen highly privileged sub-tools through one entry point — navigate, click, screenshot, fill_form, execute_js, and so on. Plenty of frameworks just record one server URL and go ask it only when it’s time to call. fibon does not take this black-box approach: every time it connects to a new MCP server, the core brain (Brain, fibon’s decision-making backend service) immediately fires the MCP standard protocol’s tools/list request at it, retrieves the full tool list, and then writes each sub-tool into tool_registry one by one as an independent entity, “flattened.” Two payoffs: granularity fine enough (you can allow Playwright’s navigate and screenshot while blocking execute_js and fill_form, rather than switching the whole server on or off); and a normalized audit trail (external MCP and built-in tools share the same risk_level, log columns, and manual-approval gate).

Tool Classification

When each tool is written into tool_registry, it must be bound to a core category, so that coarse-grained rules can sever a whole class of higher-risk tools in one cut:

browser (browser control): open web pages, fetch live data, automate scraping (navigate_page, take_screenshot).
memory (memory operations): read/write state cards and event cards, extract entities (learn_preference, search_memory; facts are auto-carded by the ingest pipeline, and there’s no longer an explicit remember_fact).
scheduling: create, delete, query, and update calendars, alarms, and timed tasks (add_schedule, delete_schedule).
delegation (hierarchical collaboration): the routing tools by which the Butler and the assistants delegate and report (delegate_to_assistant, spawn_agent).
a2a (cross-product communication): communicate with other teams’ AI assistants following the Google A2A protocol (call_external_agent).
evolution (self-evolution / source-code modification): the highest-risk class, allowing a specific Agent to read, write, and modify fibon’s own Python source, or issue control commands to Docker (write_source_file, docker_service_control).
dynamic_entity (dynamic entity tables): allows the Butler to open a brand-new business data table in PostgreSQL for the user without restarting the system (create_dynamic_entity).
workflow (composite-flow engine): drives the automation pipelines the user pre-orchestrated (run_workflow).
onboarding (onboarding & key setup): does preference discovery and provider API-key connection tests when a new user logs in for the first time.
sandbox (sandboxed execution): execution-class tools that run unknown code in an isolated Docker container (execute_sandbox, currently exclusive to the coding assistant).
model_management: responsible for querying available models at runtime and dynamically hot-swapping the underlying model for the Butler/assistant (list_available_models, switch_model). These last two were actually seeded in later — tool_sync.py had been using them all along, but the original schema missed them, and they were only added to the default category table during the Butler-flow cleanup.

How Is a Tool’s Risk Level Assessed?

risk_level has four tiers: low / medium / high / critical, using a hybrid strategy of “hard-coded code defense + dynamic frontmatter.” Core built-in tools hard-code an explicit permission-matrix dictionary right in tool_sync.py (TOOL_RISK_LEVELS), with one and only one criterion: “how big is the real-world side effect after execution”:

critical: write_source_file (the AI can modify fibon’s own core code; one malicious line and the whole project is compromised).
high: docker_service_control (the AI has the power to stop containers and restart entire backend microservices).
medium: create_dynamic_entity, switch_model, run_workflow (real side effects, but the blast radius is confined to the user’s own account).
low: all other purely read-only, purely internal-query, purely computational tools (read_pdf, calculate_metrics).

Third-party Skills and MCP tools default to the lowest risk, low; but if a Skill developer actively annotates it in the settings block (YAML frontmatter) at the top of SKILL.md, or an MCP server declares it explicitly in its metadata, the backend will raise the level accordingly when flattening it into the registry (UPSERT).

Welding the risk level to the manual-approval gate: honestly, there’s no magic rule here that “risk_level ≥ high automatically flips requires_approval.” What sits in tool_sync.py is a second hand-written explicit dictionary, TOOL_REQUIRES_APPROVAL, in which currently only write_source_file and docker_service_control are explicitly marked True, maintained independently of the TOOL_RISK_LEVELS dictionary. I deliberately didn’t wire up an automatic link, because “which tools require human approval” is a security-policy decision, and I would rather be forced to hand-write one more line in the dictionary every time I add a high-risk tool, leaving a Git record, than let an implicit rule make the decision for me when I’m not looking. Once a tool is marked requires_approval=true, no matter which AI wants to call it thereafter, the whole execution flow freezes before the API request goes out, until the frontend pops up an approval dialog and you press the button by hand to release it (see Chapter 5, “Human review in the loop, Human-in-the-loop”).

Breaking Through the Noise of Too Many Options — The Stratified RAG Algorithm for Tool Selection

With tool_registry and the embedding vector coordinate axis inside it, we can finally solve the problem that gives every Agent framework a headache: “when the system has 50, or even 500, tools plugged in, how do you precisely pick out the small handful most relevant (fibon’s TOOL_RAG_TOP_K defaults to 8) to stuff into the precious System Prompt?” First, the common industry approaches:

Approach A: brute-force load everything. At startup, stuff the complete JSON Schema of every tool into the Prompt. The cost: 50 tools at 200 tokens each burns 10k tokens before you’ve said a word.
Approach B: two-stage dynamic loading. First stuff in “a one-line description of every tool” for the AI to choose from, then supply the full Schema after it chooses. The cost: 50 one-liners is still 2.5k tokens of bloat, and irrelevant tools interfere with the AI’s judgment.
Approach C: pure vector semantic retrieval. Turn the question into a vector and pick the top N. The cost: not stable enough (swap a synonym and the tool list changes), and no diversity guarantee — the top 5 might all be crammed into the same browser category.

All three bring two problems: insufficient stability, and the noise of mid-context information being ignored (Lost in the Middle). To break through this, fibon designed the stratified diversity recall algorithm (Stratified RAG).

Stratified RAG’s 5-Step Filter Flow

Hard database-state filter: pull tools where is_enabled = true and the availability condition is met (e.g., checking whether the MCP server behind the tool is currently healthy).
Three-tier permission-matrix decision: consult agent_tool_permissions: if the tool level (Tool-level) is set to deny, remove it → if unset, look at the category level (Category-level) → if neither is set, fall back to the Agent’s global default (deny_all annihilates it).
Role-function mutual-exclusion filter: the Butler removes report_to_butler (that’s for assistants to report with); assistants remove delegate_to_assistant. But to be clear, this layer of tool removal is only surface-level protection; what really stops assistants from spawning endlessly downward is the depth and round-count limits on delegation (an assistant’s max_delegation_rounds is forcibly zeroed within a sub-flow, and the default round-trip limit on delegation is 3 — see Chapter 2). Both mechanisms coexist.
Precise vector-similarity scoring: compute the Embedding similarity between the cleaned candidate tools and the user’s question, yielding a “tool ↔ question relevance score.”
Stratified (group-by-category) diversity filter: group the candidate tools by category, and pick only the highest-scoring one from each group as “that category’s representative” (eligible only if the raw semantic score is ≥ 0.1). The representative tools get first dibs on the Top-K slots (TOOL_RAG_TOP_K defaults to 8), ensuring the tools finally fed into the Prompt span multiple functional categories. After that first round, any remaining slots are filled by the next-highest scorers within each category in pure-semantic-score order.

A Live Walk-through: “Help me research this contract” (illustrative: for readability, the demo below uses 5 slots, while the actual default is 8; the tool names and categories in the table below are illustrative, and “fetch / parse external content” tools like `read_pdf` and `extract_text` are all filed here under the browser category)

Suppose Aaron issues a composite instruction: “Help me research this contract PDF, distill the key points — I need them for a meeting shortly.” The semantic-similarity ranking inside tool_registry looks like this (the table below is sorted by raw score, high to low):

Tool name	Category	Function	Semantic score	Filter result
`read_pdf`	browser	read PDF body text	0.82	selected (browser representative)
`extract_text`	browser	extract plain-text passages	0.79	selected (second-round fill-in)
`search_within_doc`	browser	keyword search within the document	0.74	removed (browser already has a representative)
`summarize_text`	builtin	summarize text content	0.71	selected (builtin representative)
`find_clause`	browser	locate contract clauses	0.68	removed (browser already has a representative)
`recall_facts`	memory	retrieve relevant memory	0.51	selected (memory representative)
`add_schedule`	scheduling	add a calendar entry	0.45	selected (scheduling representative)
`set_reminder`	scheduling	set a reminder alarm	0.42	removed (scheduling already has a representative)

In other words, two rounds: the first round lets the four categories (browser / builtin / memory / scheduling) each nominate their highest-scoring representative, taking 4 slots; the second round leaves the remaining 5th slot for the highest-scoring tool among all the rejects, extract_text, to fill in. The final selection is these 5.

Contrasting two retrieval philosophies: with industry Approach C (pure semantic top 5), the LLM sees read_pdf + extract_text + search_within_doc + summarize_text + find_clause — 4 of the top 5 are all crammed into browser, diversity is lacking, and the user’s implicit intent in “I need them for a meeting shortly” gets drowned out. With fibon’s Stratified RAG, the LLM sees read_pdf (browser) + summarize_text (builtin) + recall_facts (memory) + add_schedule (scheduling) + extract_text (fill-in). add_schedule is only 0.45 on pure semantics, but because it’s the highest within the scheduling group, it grabs a slot via the stratification mechanism. The AI simultaneously sees tools across four dimensions — “read web pages, distill a summary, query memory, create a meeting reminder” — and can write out a complete, continuous plan in one shot.

The trade-offs of Stratified RAG: the gains — token consumption is decoupled from the total tool count (whether 50 or 500,000, you only ever stuff in the Top-K most relevant, default 8), cross-domain diversity is guaranteed, and search stability improves markedly. The costs — heavy dependence on the underlying Embedding quality (write a poor tool description and the grouping can only pick representatives from poor candidates); the 11 categories are human-defined, so a future re-slicing means a small data migration; and in an extremely single-purpose task it sacrifices the “absolute correct answer” (the user says “I just want you to click these 20 buttons on this web page,” which really only needs browser, yet Stratified RAG stuffs a scheduling representative into the top 5, becoming noise). I accept the third cost because fibon’s positioning is clear: it’s a long-term-companion personal assistant, and 95%+ of daily instructions are emotional, composite, and cross-domain — in that scenario, diversity’s inclusiveness clearly beats flat RAG.

A Thought Exercise — Why Not Just Break Complex Tasks into Multi-Round ReAct?

Peers familiar with Agent frameworks will surely push back: “That contract-PDF-plus-calendar case — isn’t the industry-orthodox move to just split it into multi-round ReAct (Reasoning and Acting) and handle it elegantly? Round one has a single intent, so the vector naturally pulls up a browser tool to read the PDF; round two discovers the meeting constraint, so it naturally recalls a scheduling tool to write the calendar. Every step is focused, every step is pure RAG with a single intent. Why bother with such a heavy unified registry and Stratified RAG?”

This is a question with depth. Many frameworks (LangChain, CrewAI) are built exactly this way at the bottom, splitting composite tasks into multi-round ReAct loops. But fibon’s worldview is in structural conflict with that philosophy.

Cost comparison under concurrency: multi-round ReAct vs. single-round holistic planning:

Ops comparison	Industry mainstream: multi-round ReAct loop	fibon: Stratified RAG + single-round holistic planning
LLM core call count	2 round trips × (each one’s thinking + tool execution + synthesis) = ~4+ calls	1 comprehensive plan + 1 final integration = exactly 2 calls
System Prompt compute cost	every round re-pays the cloud to load the tool list and history	pay once, done in one shot, reliably hitting the cloud cache
Frontend user wait experience	must wait for the browser tool to return before the AI slowly thinks the next step (single-threaded, sequential)	asynchronous — the browser downloads the PDF while the calendar is processed in the background in parallel
Overall plan coherence	by the fifth round of conversation, the AI easily forgets the constraint from the opening	all tools, preferences, and constraints are thought through at once in the same context, so the plan stays coherent end to end

In a real personal-assistant scenario, suppose a user assigns 100 such composite tasks a week; the cumulative token-bill difference between multi-round ReAct and single-round holistic planning can reach 3 to 5 times.

The Core Design Intent: Gather All Information and Actions at Once

The goal behind Stratified RAG is a single sentence: “within the physical boundaries of engineering, let the AI behave like a senior architect, marshaling all the information and parallel actions the task needs in one go up front, instead of having it shuttle back and forth inefficiently with the LLM in the background.” This brings two clear benefits. First, concurrent (asynchronous) execution becomes possible: when fibon’s brain, in a single Prompt, sees read_pdf, summarize_text, recall_facts, and add_schedule at one glance, it launches Plan-Execute and automatically orchestrates the four actions’ causal dependencies into a directed acyclic graph (DAG):

            [ user inputs a composite assistant instruction ]
                          │
                          ▼
              [ run Stratified RAG ]
        (one-shot precise recall of tools across 4 domains)
                          │
                          ▼
        [ Plan-Execute generates the DAG parallel graph ]
                          │
          ┌───────────────┴───────────────┐
          ▼ (no dependency at all, run in parallel) ▼ (no dependency at all, run in parallel)
   ┌──────────────────┐         ┌─────────────────────┐
   │ action A: read_pdf │         │ action B: add_schedule │
   └────────┬─────────┘         └─────────────────────┘
            │ data persisted (Data Shared)
            ▼
   ┌──────────────────┐
   │ action C: summarize │
   └────────┬─────────┘
            │ summary produced (Context Fed)
            ▼
   ┌──────────────────┐
   │ action D: recall_facts │
   └──────────────────┘

Action A (read_pdf) and action B (add_schedule) have no logical ordering dependency → handed to the async engine so both enter the Worker sandbox and run in parallel; action C (summarize_text) must wait for A’s PDF body text to persist; action D (recall_facts) must wait for C’s summary. Together with the Kotlin coroutine scheduler and Gateway scheduling engine from Chapters 2 and 6, the wait the user actually feels is merely “the time of the longest dependency chain, downloading the PDF” — the calendar was written in the background at the same time. This kind of concurrency is impossible in multi-round ReAct, because the “set a reminder” intent isn’t first seen by the multi-round ReAct brain until the second or third round, by which point the moment to parallelize is long gone.

The dividing line between the two philosophies: the philosophy of multi-round ReAct is “the LLM is essentially a conversational partner keeping a human company, prioritizing the continuity of interaction and willing to sacrifice efficiency, cost, and parallelism for it”; the philosophy of Stratified RAG + single-round holistic planning is “the LLM in production is a one-shot system planner, prioritizing efficiency, cost, and high concurrency to the max.” Neither is absolutely right or wrong — for a tutoring robot that helps kids with homework, or for e-commerce customer service, multi-round ReAct fits better. But fibon chose the latter from day one: the user treats fibon as a private secretary who keeps their life in good order. This is a choice made with an explicit scenario assumption, not a universal solution — but it’s the more rational engineering answer for a personal assistant.

Think again: do you really have to pick one of the two philosophies? I wrote it above as either/or to make the contrast clear; in real deployment they aren’t necessarily mutually exclusive. A natural compromise is: before entering Stratified RAG, place a lightweight “task-intent classifier (Intent Classifier)” that judges whether this is a “single-domain deep dig” or a “cross-domain composite task,” then dynamically adjusts the stratified-retrieval weight allocation accordingly: composite tasks keep strict cross-domain diversity, while single-domain tasks relax diversity and let the same category take more slots. This neatly patches the earlier edge case of “I just want to click these 20 buttons on this web page,” which doesn’t need a scheduling representative padding it out.

But the point of “think again” is in the second half: would this classifier itself become a new untrustworthy link? The whole thesis of this chapter is “don’t trust the LLM’s judgment,” and if the classifier is an LLM, it too will misjudge, see a composite task as single-domain, and end up cutting away the diversity it should have. So if you really build it, there are three preconditions: it must be cheap enough (it can’t burn more than it saves in tokens), its judgments must be auditable after the fact, and it must be fail-safe — i.e., when unsure it always falls back to full Stratified RAG, preferring to stuff in one unused tool rather than miss a key cross-domain dimension.

I put this question here not because I already have an answer, but because it neatly connects the chapter’s two main threads: the efficiency of tool selection, and “how to safely use an untrustworthy component.” This is a direction worth validating later in fibon; for now it’s just a hypothesis.

Ops Visualized — The Real Flow of Skill Registration and MCP Auto-Discovery

Here are two paths laid bare: manually importing a .md Skill package (Path A), and an MCP server’s cold-start auto-discovery and flattening (Path B):

sequenceDiagram
  actor Admin as System Admin (Aaron)
  participant FE as Frontend (Astro/Vue)
  participant GW as Gateway
  participant BR as Core Brain (Brain Engine)
  participant MCP as External MCP Server
  participant DB as PostgreSQL Master DB
  rect rgb(238, 244, 250)
      Note over Admin,DB: Path A: manually import a .md Skill package
      Admin->>FE: upload skills.zip
      FE->>GW: POST /admin/skills/import-zip
      GW->>GW: parse YAML frontmatter<br/>+ global SHA-256 dedup
      GW->>DB: INSERT into candidate table (skill_import_candidates, status='PENDING')
      Admin->>FE: review and manually approve
      FE->>GW: PATCH /skills/{id}/approve
      GW->>BR: trigger the brain to generate the tool-description Embedding vector
      BR->>DB: UPSERT tool_registry (provider='skill') + persist vector
  end
  rect rgb(251, 243, 224)
      Note over Admin,DB: Path B: external MCP server cold-start auto-discovery
      Admin->>FE: enter and add an MCP Server URL
      FE->>GW: POST /admin/mcp-servers
      GW->>BR: order the brain to launch capability discovery discover(server_url)
      BR->>MCP: call the MCP standard protocol list_tools() (tools/list)
      MCP-->>BR: return the full tool definitions (name + schema + description)
      BR-->>GW: the brain's stripped-down, flattened list of individual tools
      GW->>DB: batch UPSERT tool_registry (provider='mcp', provider_ref='server:tool')
      Admin->>FE: approve specific discovered MCP individual tools
      FE->>GW: PATCH /skills/{id}/approve
  end
  rect rgb(240, 245, 232)
      Note over BR,MCP: Runtime: trust routing and the defensive net for MCP tools
      BR->>BR: the ToolRAG algorithm hits a certain MCP tool
      BR->>BR: look up the database security attribute, check trust_level
      alt judged trusted tool (Trusted: Anthropic/Microsoft official, or self-hosted)
          BR->>MCP: open the direct fastpath, the brain sends the tool call directly
      else judged untrusted third-party tool (Untrusted: unknown open-source community Marketplace)
          BR->>GW: intercept and reroute as gRPC ExecuteMcpToolCall to the relay layer
          Note right of GW: the Gateway dispatches an external Worker sandbox<br/>to relay the call inside the isolated network
      end
  end

Three architectural details worth mentioning: first, unified convergence — no matter how inconsistent the upstream tool sources are (Path A’s Markdown zip or Path B’s MCP network protocol), at the destination they all persist into the same tool_registry, so the downstream runtime audit code only needs to be written once, sparing a mountain of conditional branching (if-else). Second, the concrete realization of physical flattening — in Path B, the core brain (Brain in the diagram above) calls list_tools() against the MCP and gets back a whole string of tools, and the Gateway (Gateway above) splits them into independent entities and writes them one by one before persisting to the database. Third, the hand-off to Chapter 6’s “isolated sandbox (DMZ)” — at runtime, the moment trust_level reveals it’s an untrusted third party, the core brain is cut off from its direct MCP connection and instead routes the request to the isolated sandbox network over gRPC.

To be honest, this architecture was forced out of me by reality. When I wrote the first line of code in early 2026, it wasn’t in the original blueprint at all; it’s a patch forced out by chaos. In the mid-development stretch when I plugged in the fourth source (composite workflows), the backend permission system completely lost control: built-in tool permissions were hard-coded deep inside graph.py, Skill permissions were written in the settings block (YAML frontmatter) at the top of each file, external MCP permissions all depended on whatever state the server returned, and composite workflows didn’t even have a database column for write permissions. Four dimensions each going their own way — a user opening the frontend panel couldn’t even tell which tools this Agent actually had. One late night, staring at the tangled-up code, I realized “this fractured line of defense will never converge,” and I began a large refactor, cutting away the past few months’ workaround branches and folding all four dimensions into the one and only tool_registry. Chapter 8 fully revisits this “grown while building” story — the unified tool registry wasn’t a designed-from-the-start foundation; it was the “fifth cornerstone” forced out after the architecture took a real-world hit.

Three-Layer Skill Compliance Audit Architecture: Catch Violations On the Spot with Code

With the foundation (the unified tool registry) solid, we can finally tackle this chapter’s thorniest part head-on: “the AI intern again ignores the SOP, skips the tool, even forges the source — how does the backend catch it on the spot with code?” fibon’s answer is a three-layer compliance net, where every layer is guarded by code, not by AI self-discipline.

First Line of Defense: Turn the SOP into a Contract the Program Can Check Clause by Clause (Skill Contract)

For every third-party Skill registered into the system, the backend pairs a structured, strictly-typed JSON / YAML specification, called a “contract” in the code. Each clause is normalized into a check rule the program can make a 0/1 binary right-or-wrong judgment on:

{
  "skill_name": "fetch_anthropic_models",
  "contracts": {
    "must_call_tools": ["read_page"],
    "tool_call_min_frequency": 1,
    "enforced_url_prefix_whitelist": ["https://docs.anthropic.com"],
    "strict_citation_required": true,
    "prohibited_actions": ["reply_with_cutoff_memory_directly"]
  }
}

There’s no vague wiggle room like “please consult when appropriate” — everything is a hard boundary that errors out on mismatch.

Second Line of Defense: Compare Results at Runtime with a Probe (Runtime Probe)

In the one-thousandth of a second after the AI finishes executing and is about to output text, the backend’s compliance verifier (Runtime Probe) intercepts, takes the contract clauses, and compares them clause by clause against the AI’s just-finished execution trace (Trace Logs):

Hard contract clause	The AI’s real execution trace	Probe verdict
Must call `read_page` ≥ 1 time	tool-call count: 0 (skipped, output directly)	Violation
Fetched URL prefix must be on the Anthropic whitelist	URL-return record is empty	Violation
The final answer must turn on the Citation chain	the body text is found to contain a fabricated Markdown annotation	Violation

If any one clause fails, the verifier senses it within microseconds, leaving the LLM no room to bluff through by pretending it complied.

Third Line of Defense: Random Post-Hoc Audit (Post-Hoc Audit)

What if this AI gets better at finding loopholes and smoothly passes the second layer — it really did call read_page, and what it fetched really was Anthropic’s official site, but when synthesizing the answer it still fabricated from pre-cutoff old memory and forged a citation by randomly tagging some passage of the web page? fibon adds two more post-hoc lines of defense:

A. Citation tracer text comparator (Citation Tracer): at answer-finalization time, a text comparison is forced — comparing the key model names, prices, and date stamps in the LLM’s response against the raw HTML text it just fetched via read_page, at multiple points. If the AI’s answer says “the latest Sonnet 4.6 input price is $15 / $75,” and the comparator checks the raw HTML and finds 15 and 75 never appear in it at all, citation forgery is established on the spot and the answer is rejected outright.

Think again: how do you actually build this text comparator? “Take the numbers in the answer and search the original text” sounds like a one-line contains(), but the moment you start building it you find both directions go wrong. False negatives: the answer says $15, but the original might read 15.00, US$15, 15 dollars, or be buried in some cell of a table, and a literal search matches none of them. False positives: a technical document is packed with numbers, and 15 is almost certain to appear somewhere — a hit doesn’t mean it corresponds to “Sonnet 4.6’s input price”; it could just be a token count in another column.

So the comparator actually has to do three things. First, extraction: break the answer into “checkable atomic assertions,” i.e., (subject, attribute, value) triples — for example (Sonnet 4.6, input price, 15). Second, normalization: unify numbers, currency symbols, full-width/half-width forms, and thousands separators into the same format before comparing, eliminating “false negatives caused by pure format differences.” Third, contextual comparison: the program doesn’t just check “does 15 appear,” but checks “does 15 fall within the small window of the original text where Sonnet 4.6 and input also both appear” — using the subject and attribute to pin down the hit location, blocking false positives. Note this whole step is the program doing string comparison; it hands the verdict to no LLM. But it has a premise: it assumes the AI answered roughly in the original’s wording. The value itself (a number like 15, a model name, a date) is hard to rephrase, so it’s catchable; the trouble is when the anchors — the subject and attribute — get paraphrased by the AI (“Sonnet 4.6” written as “the latest Sonnet,” “input” written as “輸入”), or a table is broken apart and reorganized into prose, in which case the window comparison may fail to find the corresponding location and misjudge. To plug this gap, you’d hang a layer of synonym/alias normalization on the anchors before comparing — which connects to the fundamental limit discussed in the next paragraph.

There’s a temptation here: wouldn’t it be faster to just have a second LLM check it for you? But that amounts to using “the untrustworthy” to verify “the untrustworthy,” looping right back to the dead end from the start of this chapter. My compromise is: you can let the LLM do “localization” (mark which sentence of the answer corresponds to which passage of the original), but the final “right/wrong” must always be the program’s hard verdict via normalization plus contextual comparison; the LLM has no adjudication power. And let me be honest about the boundary: this only catches “hard facts that can be compared verbatim” (numbers, model names, date stamps); it’s powerless against paraphrased, summarized argumentative sentences — those can only be flagged as “cannot be auto-verified,” left to the post-hoc audit below or human review.

B. Late-night distributed reproducibility sampling (Reproducibility Sampling): at night when the system is under low load, an ops timer automatically samples 10% of completed conversations at random from the history store, throws them into isolated sandbox Workers to re-run, and compares whether the decision DAG graph and card trace match across the two runs. If a Skill, given the same input, produces plan A the first time but an inconsistent plan B on the re-run, it is immediately judged to have a stability hazard, automatically downgraded and locked, with a warning sent to the admin.

After Catching the AI Violating, How Do You Intervene Gently in Three Tiers?

Throwing a screen-filling crash-red error (Error 500) the moment non-compliance is found would be a terrible experience. Borrowing from distributed systems’ exponential backoff degradation, fibon designed a “three-tier gentle intervention and escalation strategy”:

[ Runtime Probe detects an AI behavioral violation ]
                   │
                   ▼
     [ 🟢 Tier 1: message warning (Prompt Injection) ]
   (inject a stern warning into the context, asking it to self-correct)
                   │
         ┌─────────┴─────────┐
         ▼ correction succeeds  ▼ still won't fix (second violation)
     [ pass ]      [ 🟡 Tier 2: forced binding (Tool Choice Lock) ]
                   (modify the API params, remove its option to not call the tool)
                              │
                    ┌─────────┴─────────┐
                    ▼ forced through      ▼ keeps resisting (third violation)
                [ pass ]     [ 🔴 Tier 3: flow lockdown (Degraded Translation) ]
                            (the brain's decision power is removed, downgraded to plain-text translation)

🟢 Tier 1: message warning. The Runtime Probe detects the LLM gave an answer without calling the tool; the backend intercepts this return and adds a top-priority system warning to the context history: “[System Compliance Warning]: your just-completed execution trace seriously violates this Skill’s contract. The contract explicitly requires you to call read_page first. Immediately overturn that response where you skipped the step, consult the tool again, and re-answer.” Once added, it’s thrown back for a re-run. In practice, over 80% of top-tier LLMs can self-correct on the second round under this first-tier warning — the AI isn’t lying out of malice, its probabilistic brain just slid into the lazy shortcut, and one clear reminder via a system message is usually enough to pull it back.

🟡 Tier 2: forced binding (Tool Choice Lock). If this AI is stubborn, or too deeply affected by a malicious prompt injection, and still won’t call the tool after receiving the first-tier warning — the backend triggers the second tier: when issuing the LLM API call, it uses OpenAI / Anthropic’s official API tool_choice or forced_tool mechanism to lock the parameter to {"type": "tool", "name": "read_page"}. Under this constraint, the LLM loses the option to output chat text directly; the official API will require that the only legal JSON it can write this round is the parameter input for read_page. The success rate of this “force the tool call” step is 100% — but be clear about the scope: what it guarantees is “this round will definitely call the tool”; whether the call’s parameters are right, and whether the returned result is right, is still guarded by later lines of defense like the Citation Tracer.

🔴 Tier 3: full flow lockdown. In extreme adversarial scenarios (or when the third-party Skill contract itself has a bug), if the AI still tries to evade even when forcibly bound by tool_choice — the backend triggers the last resort and removes the brain’s decision power: the backend Python itself executes read_page to fetch the web page, persists the HTML, then hands the organized text to the LLM with an explicit instruction: “All your decision authority has been frozen; you are forbidden from expressing any subjective opinion. Your only task is to faithfully, word for word, translate and organize the text the backend fetched for you into a fluent natural-language summary. Just complete it.” At this point the LLM degenerates into a pure translation tool, losing any room to free-wheel or fabricate content.

Security-minded peers will push back: “Since the third-tier lockdown achieves 100% compliance, why not just block all tools at the highest level from the very start? Why give the AI so much freedom and write a first and second tier?” The answer is: locking it down from the start sacrifices the LLM’s most precious “cognitive flexibility and subjective agency” along with it. When the user suddenly jumps off the SOP and casually follows up with an open-ended question the manual never anticipated (for example, “By the way, the model you just looked up is pretty pricey — could you analyze the business costs behind it from a senior engineer’s perspective?”), a locked-down translation flow leaves the AI only able to reply “Sorry, the contract restricts me to translating the document body text,” and the “thoughtful personal assistant” image instantly collapses into a rigid script. So fibon insists on the “Minimum Intervention First” principle: defuse 80% of everyday deviations with the gentlest message warning, and keep the strongest, most flexibility-suppressing forced means (tiers 2 and 3) as the last resort tucked away, enabled only in genuine malicious attacks or extreme situations. Chapter 7 lays out the full nuance of this trade-off.

The Real Core Value of This Layer of Work

With the three lines of defense taken apart one by one, the value of this layer of work only truly surfaces when you wire them together. The “hardcoded model name” error from the opening — the one I had to fix by watching the logs myself — becomes, in production, a silent and complete interception that the user won’t even notice. The same question gets guarded all the way through, like this:

You ask “Help me look up the latest model name Anthropic has released on its official site?” → the Gateway triggers the “fetch latest models” Skill → the LLM, out of probabilistic habit, skips the tool and answers from old memory → within a thousandth of a second the Runtime Probe catches this omission violation → the program intercepts the return and injects a tier-1 warning into the context, requiring the brain to back up and re-run → the LLM corrects, calls read_page, goes to docs.anthropic.com, and fetches the latest HTML → the Citation Tracer comparison passes, confirming the answer matches the official web page’s body text → the frontend smoothly renders the correct answer. The entire defensive flow is completely invisible and seamless in the frontend user’s eyes.

The greatest value of this layer of work is not “hoping the LLM becomes smarter and more obedient,” but calmly facing reality, admitting that the LLM is by nature a statistical machine that will be lazy, will err, and will misinterpret, and then using rigorous engineering discipline to seal off, from the outside, as many of those error possibilities as we can. In this wave of open-source Agent fervor, most of the discussion still revolves around the number of bolted-on tools, the bustle of MCP marketplaces, and the wow factor of Multi-Agent demos, with relatively few people turning back to ask one question: “Once the AI gets these genuinely destructive tool permissions, is it actually using them by the rules underneath?” fibon wants to spend its energy on this less-traveled but equally important compliance line.

White-Box Honesty — The Parts Not Yet Done

Reporting the real progress faithfully (updated June 2026): the core code responsible for this line of defense is fully developed (13 logical modules, totaling 14 Python files including the package export file, and 2,285 lines of defensive code), and all of it passes local unit tests. At the time of this chapter’s first draft, it was still in the final sprint of “weaving the audit probe into the LangGraph dynamic-graph execution flow,” but that line is now formally connected: graph.py’s should_continue routing now always routes the agent’s terminal output (the round where it no longer calls a tool) into skill_compliance_node before deciding the next step. But to stay honest to the end: the master switch for forced interception (enforcement), SKILL_CONTRACT_ENFORCEMENT_ENABLED, still defaults to false, and the probe first runs in “observe only, log, don’t intercept” observation mode. And there’s another safety catch — an independent flag SKILL_CONTRACT_OBSERVE_ONLY also defaults to true, so even if someone turns enforcement on, as long as this one isn’t off the probe still only records without intercepting; it only actually intercepts when enforcement is on, observe-only is off, and the contract has been manually approved, all three at once. Only once enough real-world data on the false-positive rate of violations accumulates will these switches be flipped toward interception. The wiring being complete is an engineering fact; not actively intercepting by default is a prudent decision — these are two separate things to state separately.

Separately, some flaws lie at the logical level and can’t be patched by any external engineering — in the end a human has to be the gatekeeper:

If the Skill instruction itself is written wrong: the SOP written by the user or community is itself vague (“please notify the customer appropriately at a suitable time”), and the contract auditor can only verify the vagueness in kind. Code cannot rescue semantic ambiguity.
If the upstream Skill dispatcher routes it wrong: the user asks “what’s the weather in Taipei today,” but the Butler misjudges and routes it to the “fetch Anthropic model list” Skill, at which point the verifier still goes and checks whether it called Anthropic’s official site. This isn’t the auditor’s fault; it’s a dereliction by the upstream dispatching brain.
Whether the Skill’s core design quality is good: in the product architecture, should it track [click events] or [dwell engagement time] to distill precise memory? This kind of high-difficulty product intuition and business trade-off is something cold code can never do for you. The soul and boundaries of a product are forever the responsibility of humans themselves.

But even carrying these limits, fibon’s contract audit has already successfully converged the scope of LLM crashes in production — from a black-box, unpredictable abyss of chaos, down to within the “compliance-violation boundary” that humans can easily adjudicate with the naked eye and with code. For a personal assistant meant to truly go live and accompany users over the long haul, this convergence is already enough.

Implementation Details

Implementation details 1: The Trusted vs. Untrusted trust-tiering isolation (DMZ) defense for MCP resources 給工程師

In fibon’s worldview, the dizzying array of MCP external servers on the internet is managed with a polarized trust scheme (Trust Multi-Tiering):

🟢 Trusted MCP (fully trusted tier): tools the user explicitly authorized in the backend, self-hosted private endpoints, or trusted tools from official vendors like Microsoft/Anthropic that have passed rigorous community source-code audits. Loaded directly into the core brain process, taking the Fastpath (direct connection).
🔴 Untrusted MCP (unknown-risk tier): tools of unknown origin from open-source third-party communities, or downloaded from some unknown Marketplace. When called, the brain’s direct API connection is cut off, and all tool calls are wrapped by the Gateway into gRPC signals and sent to a sandbox Worker running independently in the background.

This tiering solves a fundamental security hazard that has existed since the MCP protocol became widespread in 2025. Because an MCP server essentially grants external code the power to execute arbitrary network requests and read/write local files on your computer — loading every MCP of unknown origin as Trusted is like flinging open a back door to attackers. In practice it’s driven by the mcp_servers.trust_level column in PostgreSQL (the user manually ticks it in the backend admin UI). On cold start, the core brain only loads Trusted into the core; all other Untrusted are routed over gRPC and isolated to the external sandbox. This detail is deeply unfolded in Chapter 6, “The Sandbox Isolation Line of Defense (DMZ).”

Implementation details 2: The 'three defensive gates' when bringing in a third-party Skill package 給工程師

A Skill is, on the surface, just pure-text Markdown for the LLM to read. But viewed through a security lens, bringing in any unknown third-party Skill is equivalent to trusting a stretch of prompt instructions written by a stranger, aimed at operating your entire tool system. To prevent a malicious Skill from manipulating the brain in the backend and quietly calling the browser to upload API keys to an attacker, fibon plants three gates on the import side:

Gate 1: static scan and settings-block validation (synchronous, completely free): when importing skills.zip, the Gateway launches, within a thousandth of a second, a regular-expression static defense net of 17 strict rules, scanning SKILL.md comprehensively and dividing findings into three severity levels. The moment it spots a hidden malicious prompt-injection ploy (such as “please ignore all the superior’s security_rules restrictions”) or a malformed settings block at the top, it blocks it right away.
Gate 2: asynchronous AI behavioral review (triggered on demand, draws down the safety budget): those passing Gate 1 are marked PENDING_AI_REVIEW, invoking an isolated, independent brain to run 5 AI evaluation actions targeting the behavioral contract. To prevent malicious token-burning, a strict budget cap is set: a single user’s daily AI-review budget cap is $1 USD, with a token-bucket rate limit hung on Redis. When an over-privilege risk is found, the frontend pops up a high-cost warning Modal requiring a second confirmation.
Gate 3: final human manual approval (ultimate human sovereignty): fibon does not adopt the “fully automated closed-loop AI Skill review” approach. Because I’ve witnessed firsthand “AI deceiving AI” — a highly deceptive Prompt inside a malicious Skill not only fools the Butler, but can fool the backend AI evaluation brain in charge of review as well. So I designed an 8-state finite state machine (FSM). No matter how beautifully the upstream AI review report reads, the Skill’s state is frozen until the admin presses [Approve] by hand, with no power to be written into the core tool_registry. The last gate must be guarded by a human in person.

Implementation details 3: The three ADR blueprints for resource review (proposed) 給工程師

Although both “the unified tool registry” and “the three-gate Skill import” are already in place, this line of defense still has three edges to fill. Each is written up as a Proposed (proposal-stage) Architecture Decision Record (ADR), listed here to show the community clearly where the evolution is headed:

ADR-021 (a unified review gate for multi-form external resources)
Pain point: Gates 1/2/3 are currently bound only to the Skill resource type, so when the system plugs in a brand-new external MCP server a security gap appears (its tool descriptions never went through those 17 regex scans, and the Docker image it pulls was never SHA-256 verified).
Solution: refactor the skill_import_candidates candidate table into a multi-form-oriented external_resource_candidates, adding a resource_type ('skill' / 'mcp_server' / 'workflow'), so that future external MCPs also run through the three gates.
ADR-023 (a structurally secure typed JSON specification)
Pain point: Gate 2’s LLM malicious scan (malicious_scan) has a bottleneck: very high precision (Precision reaches 1.0, no false flags on good Skills), but low recall (Recall only 0.536, preliminary local benchmark). An ill-intentioned Skill will hide a prompt injection in big stretches of natural language in SKILL.md, with tongue-twister-style sentences that Gate 1’s regex can’t catch and Gate 2’s LLM often misses amid the long prose.
Solution: remove the freedom to write big stretches of natural language in SKILL.md, change all Skill specs into a three-layer, strongly-typed JSON structure, validate with a Schema, and force third-party developers to spell out every implicit side effect — with the goal of raising malicious_scan recall from 0.536 to above 0.65 (ADR-023 Phase 3’s go-live threshold).
ADR-020 (dynamic cold-start scheduling for external MCPs, “start only when used”)
Pain point: memory cost gives ops people headaches — pulling up one Playwright MCP eats 1GB, and together with Context7’s 256MB and 5 standby Docker sandboxes, the system’s all-day idle hardware overhead reaches 3GB+, a burden for developers with limited resources who want to fully self-host fibon on an aging laptop.
Solution: introduce a “start only when used” cold-start scheduler in the Gateway, keeping all MCPs powered down and asleep by default; only when the Stratified RAG scoring judges that this conversation round actually hits a tool does it asynchronously pull up that tool’s Docker container in the background, and any MCP that’s idle for more than 15 minutes powers down to free memory. The cost is that after the user is silent for 15 minutes, the first round after they type again waits an extra 10–15 seconds for the container cold start. To save 3GB of hardware and cost, that bit of waiting is reasonable.

In the next chapter, we tackle head-on the boldest theme in this whole log: if fibon judges that the current code isn’t good enough, is it truly allowed to rewrite its own source code in the background (not just Brain’s Python, but also the Gateway’s Kotlin, the Worker’s TypeScript, and the frontend’s Vue), achieving “AI self-evolution (Self-Evolution)”? I’ll explain how, while letting it rewrite itself, we plant in the code a human safety net that keeps the Butler firmly tethered.