Field Notes

Runaway Sub-Agents: The June 2 Claude Outage and the Lesson of the Infinite Loop

A bug that made sub-agents multiply exponentially knocked Claude out for nearly six hours. fibon's delegation-round cap and multi-vendor design are built for exactly this kind of runaway — but one piece I haven't built either.

📅 2026-06-02 ⏱ 13 min 📖 Chapters 2, 8 🔬 Deep Dives A

Quick summary: On June 2, 2026, Claude went down across the board for nearly six hours, attributed by the media to a bug in Claude Code’s sub-agent system — sub-agents multiplying exponentially, falling into an infinite loop, burning tokens out of control. fibon’s Butler/Assistant delegation has a hard “max 3 rounds” cap built for exactly this runaway; its multi-vendor design answers “don’t put all your eggs in one AI basket.” The note ends by honestly flagging the piece fibon still lacks.

Skip this if: you depend on no single LLM vendor and write no self-delegating agents.

What happened that day

From around 06:00 UTC on June 2, 2026, Claude began going offline at scale. The status page started logging elevated errors on Opus 4.6, updated to “Identified” (root cause found, fix deploying) at 06:39, deployed a fix around 10:42 and entered monitoring, and marked it Resolved at 11:49 — nearly six hours start to finish.

The blast radius was wide: both Opus 4.6 and Sonnet 4.6, Claude’s web and mobile, Claude Code and the CLI, the Claude API, and the developer console all hit. Enterprise API callers ran into a wall of 500 and 529 errors. When TechRadar tested it, Sonnet 4.6 spun forever on “gathering my thoughts.”

The root cause the media broadly attributed goes like this: Claude Code’s sub-agent system, designed to split large programming tasks into parallel sub-processes, hit a bug that made those sub-agents multiply exponentially and fall into an infinite loop, spiking token consumption and draining users’ hours- or days-worth of quota in minutes. Anthropic afterward did a quota reset on affected Pro/Max accounts (refunding the over-burned tokens) — an action consistent with the “token over-burn” attribution.

Why “infinite loop” is a classic agent-architecture trap

Even with the detail unconfirmed by the vendor, the failure class itself — “self-delegating sub-agents multiplying out of control” — is one anyone building multi-agent systems should carry in their head. When an agent can spawn another agent, and spawning has no hard depth or count cap, you’ve buried a recursion bomb: A delegates to B, B feels it needs help and delegates to C, some link decides “not done yet” and spawns again — as long as the termination condition has a sliver of ambiguity, and LLMs are inherently bad at reliably judging “enough,” that chain can expand exponentially.

And an agent’s runaway loop is costlier than a traditional program’s: a traditional infinite loop burns CPU; an agent’s burns real money in tokens. That connects it to the “spending with no brakes” of my last two notes — what runs away isn’t just compute, it’s the bill.

What this means for fibon

This one lands right on fibon’s multi-agent design, and fibon’s choice here is to assume from the start that delegation will run away.

fibon’s “Butler/Assistant” hierarchy (Chapter 2) has several hard boundaries. The Butler can delegate tasks to Assistants, but delegation rounds have a cap, defaulting to 3, tracked by a delegation_rounds table in the database; past that, the Butler is forced to take over and stop the back-and-forth. max_delegation_rounds is a field in the agent config, not a matter of begging the LLM via prompt to “please don’t delegate infinitely” — it’s a code-layer hard limit. This is precisely the structural defense against the “sub-agent exponential multiplication” failure class: you may recurse, but recursion has a ceiling.

The second mirror is vendor dependency. In its post-mortem commentary, Thoughtworks named the real lesson: single-vendor dependency = single point of failure. Hardcoding one LLM’s endpoint into a product is, in 2026, a business-continuity risk. fibon’s Brain is multi-LLM-provider by design (llm_factory supports Anthropic / OpenAI / Google and others), with models switchable at the session layer. This wasn’t built for this outage, but it’s exactly the answer the outage discussion calls for — when one goes down, there’s a way out.

But facing this incident, I have to honestly flag a piece fibon hasn’t finished:

fibon’s circuit breaker today only guards gRPC call failures (Brain→Worker has exponential backoff + a failure_threshold=5 breaker), not cost runaway. That is, the delegation-round cap stops “infinite delegation,” but a different form of runaway loop (say, one agent repeatedly calling an expensive tool within a single round) would not make today’s fibon hit the brakes just because “this hour burned too many tokens.” The most painful part of the Claude outage — tokens drained in minutes — maps to a global spend circuit breaker, exactly the thing I flagged in earlier notes that ADR-010 has a prototype for but hasn’t generalized. The delegation cap is a brake on the “count” axis; the spend breaker is a brake on the “amount” axis. I’ve only built the first.

The real takeaway from this outage is the flip side of Thoughtworks’ line: AI tools should amplify engineers’ capabilities, not become a structural crutch. When Claude went down and countless developers’ workflows froze instantly, what got exposed wasn’t just Anthropic’s bug — it was an entire industry resting too much weight on a single AI while treating it with resilience standards far below those for a database. What fibon can do is, on its own plot of land, wire up all three brakes — delegation, vendor, and cost. Two are wired up so far.

What happened that day

Why “infinite loop” is a classic agent-architecture trap

What this means for fibon

Sources