Field Notes

Zombie Caches and Stolen Keys: A Teardown of Two Runaway AI Bills

Reverse-engineering how Google's billing system broke from the shape of a BigQuery export — and an honest audit of which defenses fibon has built, and which one is still missing

📅 2026-06-07 ⏱ 14 min 📖 Chapters 4, 6 🔬 Deep Dives A, C, D

Quick summary: two runaway Gemini API bills from the first half of 2026 — “deleted caches that kept billing” and “a stolen key that burned $82,000 in 48 hours.” Both look like exploding invoices; the failure mechanisms are completely different. What amplified the damage, though, is one and the same structural flaw: cloud billing is an open-loop system. The note ends, as always, with fibon: which bug classes have no surface to attach to here, and which defense we still haven’t built.

Skip this if: you don’t use any pay-as-you-go LLM API and have no curiosity about what a metering pipeline’s internals look like.

Incident one: the deleted cache that kept billing

On June 7, 2026, Brazilian developer Danilo Oliveira posted an SOS on the Google AI Developers Forum. His system ran analysis jobs using Gemini 3 Flash’s context caching. On the afternoon of June 6 he noticed the bill was wrong: after shutting down the script that created the caches and confirming via the official REST API that the cache list was completely empty, a billing SKU called “cached text storage token hours” kept charging him over 1,000 Brazilian reais per hour. By the early hours of June 7, the cumulative bill hit R$17,847 (several thousand US dollars). His last-resort tourniquet: disabling the Gemini API service for the entire Google Cloud project.

He did something valuable for everyone: he exported his billing data to BigQuery and posted the hour-by-hour breakdown. The shape of that data is more honest than any prose. It has three phases:

The first two days (June 3 – midday June 5): a steady 4–5M token·hours, 20–30 reais per hour — the baseline of his script running normally.
The runaway phase (from the afternoon of June 5): usage starts compounding, climbing all the way to 200 million token·hours per hour.
The frozen phase (after killing the script on June 6): the hourly billed quantity locks at exactly 200.7142M token·hours — identical to four decimal places, charged like clockwork every hour, until he pulled the plug on the entire API.

Reverse-engineering the failure from the bill’s shape

To read this data you first need the billing model of explicit context caching. You upload a large block of text (say, a long document) as a cache; subsequent requests reference it instead of re-sending it. The price is a storage fee: cached token count × hours stored. Note the essential difference from a normal API call: a call is a one-shot event, while a cache is a stateful cloud resource that bills continuously — a rented storage unit with a meter that runs every hour until you move out.

So under normal operation, deleting a cache (or its TTL expiring) should stop the meter. From here on is my speculation — but the three-phase bill shape is nearly impossible to explain by any mechanism other than this one:

The resource plane and the billing plane are two separately-governed states. When you call the cachedContents list/delete API, you operate on the resource plane’s registry; billing runs on a different pipeline — periodically snapshotting “total tokens currently in storage” and multiplying by hours. At some point, deletion and expiry events stopped propagating to the billing plane:

the runaway phase = the script was still creating new caches, but the old ones never disappeared from the billing plane, so the stock kept accumulating;
the frozen phase = with the script off, nothing new was added; the zombie stock froze in place and became a fixed-amount hourly perpetual-motion charging machine;
and the cruelest part: the user-facing API showed an empty list — the state you can see is clean, while the state being billed is invisible to you, and undeletable.

What gives the speculation real footing: this wasn’t the first time. In March 2026, another developer, Liz2k, reported the exact same pattern — she created five test caches with a 5-second TTL, the list query came back empty all day, yet her bill showed 3.4 million “storage hours,” then burned a flat $36 every day after. She called it the “infinite zombie cache.” She, too, ended up disabling the entire API — and then observed a crucial detail: about three days after the disable, the bill was retroactively corrected down to the true figures. In other words, reconciliation logic exists — but it apparently only triggers when the API is forcibly severed. The same bug class has publicly detonated at least twice within three months.

Incident two: the stolen key — $82,000 in 48 hours

The second incident dates back to February, but it only shows its full shape next to incident one. A three-person team in Mexico had their Google Cloud API key leaked. Between February 11 and 12, the thief used it to hammer Gemini 3 Pro image and text generation, racking up $82,314 in 48 hours — against the team’s normal monthly spend of $180. When they appealed to Google, they got the cloud industry’s standard answer: the “Shared Responsibility Model” — the platform protects the platform; the key is your problem.

A typical month is about $180; a stolen key rang up $82,314 in 48 hours, roughly 457x — 48-hour bill on a stolen key vs. a typical month's spend 資料來源：The Register / Tom's Hardware (2026-03)

The key leak was, of course, the team’s lapse. But what let $180 become $82,314 with nothing tapping the brakes along the way is structural:

GCP’s Gemini API has no hard spending cap. A Budget Alert notifies you; it doesn’t stop anything — it’s an observability tool, not a control. Compare the prepaid-credit models at OpenAI and Anthropic: when the balance hits zero, service stops. Naturally capped.
Billing signals lag. Billing exports can trail reality by 24 hours or more. By the time you see the anomaly, the money is gone.
Google API keys start with AIza in a fixed format, trivially harvested by scanners crawling public repos and frontend bundles. These keys were never designed to be high-value authentication credentials — until Gemini turned them into something directly convertible into money.

The common root cause: billing is open-loop

Two incidents — one a provider-side state machine breaking, the other a client-side credential failure — superficially unrelated. But what amplified both disasters a hundredfold is the same structural flaw:

There is no closed loop between the rate of spending and the authorization to spend.

Billing is an asynchronous, eventually-consistent aggregation pipeline. Every signal you can get — dashboards, budget alerts, BigQuery exports — lags by hours to days. And inside that lag window, no mechanism automatically connects “anomalous spend rate” back to “stop authorizing spend.” Control theory calls this an open-loop system: the throttle is pressed, but no sensor feeds back to the wheel. The victims of incident one and incident two were both left with the same manual brake: ripping out the entire API service.

What this means for fibon

Per this section’s convention, we end at home: can fibon withstand these two bug classes? The honest answer comes in three layers.

Layer one: the zombie-cache bug class has no surface to attach to in fibon — but that’s luck plus selection, not foresight. fibon’s prompt cache strategy (Deep Dive C has the full teardown) uses per-request cache_control breakpoints with a 5-minute TTL on Anthropic, and automatic prefix caching on OpenAI and Google. None of these mechanisms carries a separate storage billing SKU — fibon never holds any “stateful cloud resource that bills continuously,” so the failure type “an undeletable zombie resource” has nothing to attach to. When I chose automatic prefix caching over explicit caching, the reasons were engineering simplicity and sufficiency — not a premonition of this incident. The conclusion preceded the correct justification. Noted for the record; no credit claimed.

Layer two: fibon keeps an independent ledger outside the provider — that’s the capital for detection. fibon’s observability layer (Deep Dive A) already writes every LLM call’s token usage — including cache hits — into its own metrics. That means “what I believe I used” exists as a record that doesn’t depend on the provider. The victim of incident one needed three days and a manual BigQuery dig to spot the anomaly; with a daily “own ledger vs. provider bill” reconciliation job, this kind of thing pages you in hour two. The difference is R$50 versus R$18,000.

Layer three: key protection is a defense fibon’s architecture already built. fibon’s API keys live only on the server side; the frontend never touches them. The Worker that runs untrusted code is confined to an isolated network, with the design goal written in black and white: even if compromised, it cannot reach the API keys (Chapter 6). Incident two’s failure mode — keys sitting in frontend code or a public repo — has no path to occur under this architecture.

To close, four lines of defense for anyone using pay-as-you-go APIs, sorted by value for money:

A billing-layer firewall — put high-risk workloads in a separate cloud project, and wire budget alerts to a function that detaches billing automatically (GCP’s own docs describe this pattern; it is the only true hard cap).
Keys never leave the server — frontends and public repos only ever see a proxy endpoint.
Keep your own ledger — record every call’s usage yourself and reconcile against the provider’s bill; don’t outsource “knowing what you spent” to the billing system.
Avoid stateful, continuously-billed features — unless you truly need a “rented storage unit” service like explicit caching, use per-request alternatives. No state, no zombies.

You can never prevent a provider-side bug. But the blast radius is yours to draw.