Field Notes

Frameworks Amplify Execution, Not Direction: Reading DeepMind's From AGI to ASI

DeepMind argues that even if model capability stalls at human level, a hundred million AGI instances would "squeeze out" an ASI. After a stretch of building agent infrastructure, I keep landing on a nearer question — could AGI itself just be "a model that reasons correctly + a good framework"?

📅 2026-06-10 ✏️ Updated 2026-06-21 ⏱ 13 min 📖 Chapters 3, 4 🔬 Deep Dives F

Quick summary: DeepMind’s From AGI to ASI is a “map-drawing” report: it runs no new experiments and trains no new models, but lays out the possible ways machine intelligence might keep growing after human-level AGI — three tiers of intelligence, four pathways, a string of bottlenecks, plus one theoretical ceiling (AIXI). Its counterintuitive bet: even if a single model’s capability stays frozen at human level, as long as compute keeps growing, simply cloning lots of AGIs, sharing memory, and accelerating thought will “squeeze out” an ASI at the collective level. I use the report to think through a nearer question: could AGI itself just be “a model that reasons correctly + a good enough agent framework”?

Skip this if: you only want to know what the report says — the first section, “What the report actually says,” covers that. The rest is my extension from an agent-infrastructure point of view.

What the report actually says

First, its framing: the report doesn’t ask “when will AGI arrive”; it asks “what happens after AGI.” It assumes human-level AGI already exists, then works out which paths machine intelligence might take toward superintelligence (ASI), how fast, and what could block it.

And at heart it’s “drawing a map,” not “running experiments”: it spreads the possible trajectories out and labels them clearly, but offers no new experimental results. There are no new experiments, data, or models in it; what it produces is three things: a taxonomy, a theoretical reference point, and a list of open research questions. Even the numbers it cites (effective compute growing ~10× a year, scaling laws, high-quality human text running out) come from other people’s papers, not this one.

It arranges intelligence on a continuum, marked by three points: AGI (median-human across most cognitive tasks); ASI (superhuman on almost every task; the bar is set high: reliably beating the output of “tens of thousands of well-coordinated experts, restricted to a 2010 technology base, working on a single problem for ten straight years,” so a single-domain superhuman system like AlphaFold does not count as ASI); and the theoretical ceiling, Universal AI (Hutter’s AIXI framework, the one part of the report with real math behind it). All three are anchored on the Legg-Hutter intelligence score, so there’s no need to draw an exact threshold; it’s enough that “there’s a clear gap between AGI and ASI.”

The core claim, in one sentence: even if model capability stays frozen at human level, as long as compute keeps growing, superintelligence still gets squeezed out. A hundred million human-level AIs, through lossless replication, high-speed communication, and 100× thinking speed, cross the ASI line as a collective. The report didn’t dream this up out of nothing: Bostrom long ago split superintelligence into “faster than humans,” “more than humans,” and “smarter than humans,” and the report bets on the first two: no need to be smarter, just faster and more numerous.

Why can digital intelligence stack up like this? The report lists six innate advantages it has over carbon-based life, all of which scale with compute in ways biology can’t match proportionally: fast input/output, accelerable internal thinking, large working memory, the ability to migrate across different hardware, lossless copying (even a “lifetime of experience” can be duplicated perfectly), and high-bandwidth experience sharing (AIs of the same lineage can even exchange learning signals directly).

It lists four pathways from AGI to ASI: scaling up (bigger models, data, compute), a new AI paradigm, AI recursively improving itself (RSI), and ASI emerging from the interactions of a large crowd of AIs. The four aren’t mutually exclusive; they run in parallel and can even compound like interest. But only the first, scaling, has historical data you can use to predict it.

Along the way it catalogs six possible bottlenecks: the data wall, economics and resources, the current paradigm not being enough, research getting harder, the abstraction barrier, and deliberate human slowdown. The report says it can offer countermeasures for the others; the truly hard one is the “abstraction barrier” (more on that below). And as for “will these bottlenecks actually block the way,” it closes almost every one with “this is still an open research question,” so the report has hardly a single firmly committed conclusion; every judgment is heavily hedged: “can’t be ruled out,” “low confidence,” “seems plausible.”

It also stresses something the hype tends to drown out: an ASI is neither omniscient nor omnipotent. It’s still bound by hard physical and mathematical limits like the speed of light, Landauer’s principle (the energy floor for computation), P vs NP, and Gödel incompleteness, so there’s no guarantee it could cure aging or precisely simulate the entire planet.

In the end it bets on two more-likely outcomes, both flagged “low confidence”: either progress stalls before AGI, or it goes from AGI to a “weak ASI” fairly smoothly. It gives no timeline and doesn’t claim an intelligence explosion is inevitable; it only says the possibility of “cruising past AGI into ASI territory within the next decade or two cannot be easily dismissed.”

How to judge it fairly: for something that hasn’t happened and has no historical precedent, you simply can’t get empirical data; you can only draw a map of possibilities. The value of this kind of report isn’t in “what it discovered” but in “the shared vocabulary it gives everyone” (terms like abstraction barrier and multi-agent scaling laws now have names you can argue about), and in its weight as an official statement from DeepMind plus Legg/Hutter, which steers where the whole field looks next and where the boundaries of the conversation get drawn. Read it as an “agenda-setting document,” not a “research result,” and the assessment comes out a lot fairer.

A point that’s easy to misread: superhuman ≠ AGI

Quite a few Chinese-language outlets and finance channels slapped a lurid headline on this: “AGI is dead; the ASI threshold is a hundred million ordinary people.” The headline distorts, as usual: the report doesn’t say AGI is unimportant; it says AGI is a starting point, not an endpoint.

Hidden here is a definition detail people often get wrong: AGI is about “how all-round it is,” not “how high the peak goes.” The angle the report measures intelligence from is its “average across all tasks”: whether its weakest links all reach an ordinary person’s level, not how high its single strongest skill climbs. By that definition, “superhuman in some respects but not yet AGI” makes perfect sense. Today’s models are exactly that: they’ve long surpassed humans in breadth of knowledge and real-time translation across many languages, yet they keep stumbling on things ordinary people do easily: learning from past experience and not repeating the same mistake, holding steady judgment across a long task, and knowing what they don’t know. Those weak spots drag the average down, so even with a stunning peak in some areas, overall they’re still stuck below the AGI line. The problem isn’t that they aren’t strong enough; it’s that they aren’t all-round enough.

This leads to a counterintuitive corollary: the first AGI, at birth, won’t look like a “digital average person.” Because when its weakest links are brought up to an ordinary person’s level, the parts that were already superhuman don’t get flattened down with them; they stay superhuman. So the real shape of the first AGI will be a system that’s passing on the floor but superhuman at the ceiling, with a big spread between its highs and lows.

My extension: frameworks amplify execution, not direction

After a stretch of building agent infrastructure, an intuition surfaced: AGI may not be that far off — give it a model that reasons correctly plus a good enough agent framework, and that might be it. This intuition is half right, and it’s the underrated half.

In recent years, capability gains have come increasingly from the surrounding framework (the harness), not the base model itself. Multi-step planning, self-correction, tool use, remembering things across conversations: these so-called “AGI behaviors” are fundamentally a matter of “drawing out ability the model already has but runs unreliably,” not “the model genuinely can’t do it.” What a framework does is push up what a fixed model can do reliably: it squeezes 60-point ability into 95-point reliability. And reliability is itself enormous value.

The abstraction barrier: the report’s sharpest wall, the one it never actually tackled

The problem is the premise. “A model that reasons correctly” — those few words smuggle in the hardest part of the whole argument.

This is exactly what the report’s sharpest wall is asking, and they call it the abstraction barrier: feed an AI every human text from antiquity up to Newton’s era; could it come up with general relativity on its own? The report’s verdict is almost certainly not, because it’s missing the most basic “concept parts” like calculus and gravity. A framework can orchestrate and draw out existing ability, but it can’t grow a kind of capability the base model never had. If “reasons correctly” quietly includes “can come up with new concepts that never appeared in the human literature,” then the premise has already assumed away the hardest problem, and of course the rest is “just a framework.”

In Chapter 4 I put a finer version of this: train an LLM on the year 1543 (when Copernicus published), and would it say “the Earth orbits the Sun”? That example actually tests something different from the report’s. Heliocentrism wasn’t a brand-new concept in 1543: Aristarchus had proposed it back in the 3rd century BC; it was simply suppressed into a minority view. So the truth was already in the training data, just unpopular, which downgrades the problem from “inventing a new concept from nothing” to “picking the right one among a few existing rival claims.” Harsher still: in 1543, the data alone wasn’t yet enough to decide who was right (geocentrism with “epicycles” explained the observed planetary positions no worse than heliocentrism; heliocentrism only truly won out with Tycho, Kepler, Galileo, and Newton). So an honest model ought to say “the current evidence can’t decide yet,” which pushes the flaw deeper: from “saying something wrong” to “how confident it is doesn’t line up with what the evidence actually supports,” i.e. the calibration problem.

The most interesting point in this report is that it’s actually testable, yet the report just leaves it sitting there undone. The clean way isn’t to literally haul out 1543’s old books (the volume can’t reach a modern model’s scale, training from scratch makes it dumb to begin with, the originals are mostly Latin, and the cutoff date is fuzzy; if it fails to derive the concept, you can’t tell whether there’s a real wall or simply too little data). The more workable approach is to build a small artificial world: set your own rules, hide some concept (say a conserved quantity or a symmetry) in the data and never state it, give only raw observations, then see whether the model can force it out on problems that “can only be solved if you’ve grasped that concept.” And you have to split the question into two layers to test it cleanly: (A) can it form a new concept on its own (a limit of reasoning), and (B) even if it can think of one, verifying a new concept needs new physical observation (the limit of “having to go measure the world yourself”). The strongest claim (“it can never form any new concept”) is the easiest to overturn with a single counterexample; a milder version lets you draw a “how much capability grows with scale” curve; but the frontier version — whether that wall stands at the real frontier of science — basically can’t be settled with today’s systems, which is exactly why the report left it as an open problem instead of running the experiment. That’s the pattern throughout: the parts that could become experiments are nearly all parked as “future work.”

Does “LLM + Agent = AGI” hold? It depends which ruler you use

To judge the proposition “LLM + Agent = AGI,” you first have to fix the ruler. The good news: every serious AGI definition asks “what can it do,” not “what must it be made of”, and none of them say “it has to be an LLM.” So the proposition isn’t ruled out at the definitional level. More to the point: when the report describes “the current approach,” its very definition bundles in pre-training + post-training + test-time scaling + the surrounding framework (scaffolding) + tool use. In other words, “LLM + Agent” isn’t some alternative architecture; it is the very thing the report means when it asks “is the current approach enough to reach AGI.”

But the answer depends on the ruler, on which one you pick:

The practical / economic ruler (Morris’s Levels of AGI “reaching the median of skilled workers,” or “able to do most of a remote worker’s cognitive work”): the proposition holds. The missing pieces (reliability, cross-session memory, the ability to act on its own) are mostly things the framework layer, or an evolution of the current approach, can fill in; no need to tear everything down.
The theoretical / ARC ruler (the “average across all tasks,” or “learning a brand-new task from just a few examples”): the proposition doesn’t hold. Because what these two rulers reward is precisely “forming new concepts on its own” and “generalizing from very few examples,” exactly the hole the abstraction barrier says a framework can’t fill.

So the real call isn’t “is my proposition right” but “which ruler do I use.” And that comes back to fibon’s purpose: I’m building a personal assistant and practical automation, and what I care about is “can it reliably do most of a human’s cognitive work,” not “can it reinvent general relativity.” By the ruler the product actually cares about, this proposition isn’t just plausible; it holds up.

What this means for fibon

This is directly tied to fibon’s design. fibon sells “cross-session memory” as a core feature, but what it actually does is simulate continual learning with retrieval: state cards, event cards, five-channel retrieval, pulling the past back into the current conversation rather than truly updating the model’s weights. Whether “pulling it back via retrieval” is the same as “actually having learned it” is an unsolved open question, and it happens to be the very thing I deal with every day.

One layer deeper: the report lists “AI recursively improving itself (RSI)” as the most powerful of the four paths, while fencing it in with a pile of barriers. That’s the same question Deep Dive F (“Self-evolution and the RSI position”) works through: capability can grow by sheer quantity, but can direction — what to reason about, what counts as correct, what’s worth doing — also emerge from quantity alone?

I have to flag honestly what I haven’t worked out. fibon’s bet is that “the human anchor is structurally indispensable”: the evaluator is human taste written into code, and the Approval Gate (human sign-off before important actions) isn’t a crutch but a deliberately retained external signal. But I can’t prove that bet is right. The interesting thing is that the earlier idea — “hide a concept and see whether the system can force it out on its own” — is the same thing as my setup of “the evaluator = a signal deliberately hidden so the system can’t optimize it away”: a hidden concept is just a held-back evaluation target the system isn’t allowed to peek at. Maybe direction can emerge from some kind of self-play, in which case my “human-in-the-loop” design is just redundant insurance. My current read: a framework can amplify “execution” without limit, but “direction” needs an anchor poured in from outside. That’s a conviction, not a proven conclusion.

My tentative conclusion: “a model that reasons correctly + a good framework” really does point at the right road to AGI-in-practice, and the framework is underrated. But the same sentence also outlines where that road hits a wall: the abstraction barrier blocks “entirely new kinds of capability,” and without an external anchor, a system ends up with execution but no direction.

We may not be short on models. We’re short on anchors.