Apple’s The Illusion of Thinking went viral this weekend, not because it revealed some unknown flaw, but because it rigorously formalized something many of us suspected.
Today’s reasoning models don’t truly reason. At least, not in the compositional, generalizable way humans do.
What the paper shows
The researchers tested reasoning models on structured reasoning tasks (Tower of Hanoi, River Crossing, Blocks World, etc) that humans typically solve by learning general strategies, not just memorizing examples.
They found:
- Models perform well on simple tasks.
- With explicit fine-tuning (RL or supervised learning), performance improves in moderately complex scenarios.
- But accuracy sharply collapses as complexity increases.
- Even when fine-tuned on examples generated by correct algorithms, models fail to generalize to harder, unseen problems. They don’t internalize the underlying principles or structures.
In other words, reasoning models stitch together patterns. They don’t build causal or compositional models of how the world works.
Why this still matters
There’s a crucial insight here. Beyond compute, data, and parameters, progress increasingly depends on human ingenuity, the reasoning and insights encoded into training data. By “human ingenuity,” I mean humanity’s continuous expansion and development of technique in the sense described by Jacques Ellul: the systematic methods and insights we use to solve problems and master our environment.
Transformer-based models may asymptote toward the limit of human ingenuity as captured in their data, but they’ll struggle to surpass it without fundamentally new architectures or learning paradigms.
Every time humanity masters a new technique or skill, we can quickly encode it into data and fine-tune transformer models to replicate and scale it. This means humanity’s core focus can shift toward discovery, creativity, and progress. As we master something new, we rapidly automate it.
Companies like Scale AI, Mercor, and others already help encode human ingenuity into datasets that the large labs use to train their reasoning models.
But Apple’s paper signals that this alone won’t create true reasoning systems.
An alternative path: interactive, experiential learning
There’s another exciting and still under explored direction: interactive, embodied learning.
Humans develop reasoning by interacting with the world:
- Testing hypotheses
- Observing outcomes
- Updating causal models continuously
Today’s reasoning models don’t do this. They are:
- Pre-trained on static data
- Refined with human feedback and reinforcement learning
- Largely frozen after deployment
They don’t learn from ongoing interaction in a consistent, causal world.
This is why approaches that incorporate world models, interactive agents, and simulated environments are becoming increasingly interesting.
Instead of passively scaling static datasets, we can scale the complexity and realism of environments where AI agents continuously learn, test hypotheses, and build richer internal representations.
The Bitter Lesson and why experiential learning aligns with it
Some might argue this conflicts with The Bitter Lesson: the idea that major AI breakthroughs have come not from explicitly encoded knowledge, but from scaling computation and data.
But I think experiential learning actually aligns with this lesson perfectly.
Instead of handcrafting solutions, we’re scaling richer, interactive environments. Agents learn autonomously, acquiring causal knowledge through massive experience. It’s another axis of scale, interaction rather than imitation.
Something often missed when discussing Sutton’s essay is that both scaling data/compute and explicitly encoding knowledge ultimately capture human insights. Neither approach guarantees truly emergent abilities beyond human performance. For models to truly exceed human reasoning abilities, we need approaches equivalent to AlphaGo. AlphaGo outperformed humans not because it learned reasoning by imitating human players alone, but because it continuously interacted within a simulated environment, discovering deeper causal patterns through extensive self-play and experiential learning.
Conclusion
Apple’s paper provides timely clarity on the path towards AGI:
- Matching human ingenuity, even if it’s an asymptote, is a huge leap forward.
- Every new human insight can be rapidly automated at scale.
- Humans can focus more exclusively on discovery and innovation.
Truly breaking past this limit and building genuinely reasoning systems likely requires new model architectures and most importantly, better learning paradigms, that is, moving from passive imitation to active, experiential, world-driven learning.
At the same time, Apple’s paper is a reminder that transformer-based foundation models, as currently trained, likely won’t get us to AGI on their own. Breaking past the limits of today’s architectures will require new approaches to learning, ones that enable models to build causal, compositional reasoning through rich interaction with the world.
In the meantime, matching and scaling human ingenuity might be the most practical and valuable frontier we can pursue. Let’s make every new human discovery instantly more powerful.