The Missing Flywheel in Agent Building
Why language models keep getting better while AI agents remain stubbornly brittle, and how to fix it with structural separation of concerns.
There's a reason language models keep getting better while AI agents remain stubbornly brittle. It's not about capability. It's about feedback loops.
Let me walk you through this.
Premise 1: Flywheels drive compounding improvement#
A flywheel is a self-reinforcing cycle. Usage generates data. Data improves the system. A better system attracts more usage. Repeat.
This is how modern AI models improve. When you chat with a language model, your interactions (the thumbs up, the thumbs down, the regenerated responses) become training signals. These signals feed into reward models. The reward models update the weights. The weights change permanently. The model gets better.
This isn't theoretical. RLHF doubled GPT-4's accuracy on adversarial questions. A 1.3B parameter model trained with human feedback outperformed a 175B parameter model without it. The flywheel works.
Premise 2: Agents don't have this flywheel#
An AI agent is more than a model. It's an assembled system (prompts, chains, tool integrations, memory, orchestration logic) that performs tasks autonomously.
When an agent fails, what happens? The failure gets logged. An engineer reads the trace. They hypothesize what went wrong. They tweak a prompt. They run some tests. They hope it helped.
Notice what's missing: the loop never closes automatically. There's no mechanism that takes an observed failure and converts it into a systematic improvement that compounds over time. The model's weights don't update. The system doesn't learn from its mistakes.
Observability tools have matured rapidly. Most production agents now have detailed tracing. We can see exactly what happened, step by step, when things go wrong.
But observability answers "what happened?" It doesn't answer "how do we fix it systematically?" That part still requires a human in the loop, every single time.
Premise 3: Prompt engineering doesn't converge#
The intuitive response to agent failures is to improve the prompt. At first, this works well. The first few hours of prompt refinement can yield dramatic gains. Moving a system from 40% to 70% accuracy feels like magic.
Then you hit the wall.
The next twenty hours of work might add 5%. The forty hours after that? Maybe 1%. The returns don't just diminish. They collapse.
Worse, you encounter the whack-a-mole problem. You fix one failure mode, and a new one appears somewhere else. The prompt that handles edge case A now fumbles edge case B, which worked fine before. You're not climbing a hill. You're playing an endless game where fixing one thing breaks another.
This isn't a failure of engineering skill. It's a structural limitation. Natural language prompts are a fundamentally different interface than gradient descent on model weights. Changes don't compound. Improvements don't accumulate. Every fix is local and temporary.
Premise 4: Multi-step agents have compounding error rates#
Here's the math that makes this devastating.
If each step in an agent workflow has 95% reliability (which is optimistic), then over 10 steps your end-to-end success rate drops to about 60%. Over 20 steps, it's 36%.
Agents don't fail because of one catastrophic error. They fail because small errors cascade. A subtle misinterpretation in step 2 doesn't cause a crash until step 15. By then, the failure looks like it happened at step 15, so that's what gets fixed. The actual root cause remains hidden.
This is why teams report getting stuck at 70-80% reliability. The last 20% isn't twice as hard as the first 80%. It's exponentially harder, because you're fighting compounding probabilities across an entire workflow.
Premise 5: Benchmarks don't reflect production reality#
You might think: surely we can measure progress objectively? Run the agent against a benchmark, track the score, optimize toward better numbers.
The benchmarks are broken.
Analysis of popular agent benchmarks found that a third of test cases have solutions directly provided in the problem description. Models can copy rather than solve. Another third have test coverage so weak that incorrect solutions pass anyway. When stricter versions of these benchmarks were introduced, top-performing agents dropped from 70%+ to around 23%.
Benchmarks evaluate single-round interactions with complete information. Production agents face adversarial users, ambiguous instructions, evolving requirements, and failure modes that benchmark designers never imagined. A benchmark score tells you almost nothing about whether an agent will work reliably at scale.
This is why most production teams still rely primarily on human evaluation. The automated metrics don't capture what matters.
The structural asymmetry#
Let me make this concrete.
How model improvement works:
View diagram source
flowchart TB
A[User Feedback] --> B[Reward Model]
B --> C[Weight Updates]
C --> D[Better Model]
D --> E[More Usage]
E --> A
style A fill:#10b981,stroke:#059669,color:#fff
style B fill:#10b981,stroke:#059669,color:#fff
style C fill:#10b981,stroke:#059669,color:#fff
style D fill:#10b981,stroke:#059669,color:#fff
style E fill:#10b981,stroke:#059669,color:#fffThe loop is closed. Improvements compound through training cycles. Each iteration builds on the last.
How agent improvement works:
View diagram source
flowchart TB
A[Observe Failure] --> B[Manually Analyze]
B --> C[Hypothesize Fix]
C --> D[Tweak Prompt]
D --> E[Test]
E --> F[Hope It Worked]
F -.-> X{{"❌ No Learning"}}
X -.-> G[New Regression]
G -.-> A
style A fill:#ef4444,stroke:#dc2626,color:#fff
style B fill:#ef4444,stroke:#dc2626,color:#fff
style C fill:#ef4444,stroke:#dc2626,color:#fff
style D fill:#ef4444,stroke:#dc2626,color:#fff
style E fill:#ef4444,stroke:#dc2626,color:#fff
style F fill:#ef4444,stroke:#dc2626,color:#fff
style X fill:#1f2937,stroke:#ef4444,color:#fff,stroke-width:3px
style G fill:#ef4444,stroke:#dc2626,color:#fffThe loop is open. Every cycle requires human interpretation. Changes don't compound. They just shift the problem around.
This asymmetry explains a puzzling statistic: 95% of enterprise AI projects stall before reaching production. It's not that the models aren't capable enough. It's that there's no flywheel to carry agents from "works in demo" to "works reliably at scale."
Closing the loop#
The flywheel breaks because we can't isolate what went wrong.
When a prompt-based agent fails, the failure is somewhere in a wall of text. Was it the instructions? The examples? The phrasing? The model's interpretation? You can't know. So you tweak the whole thing and hope.
This is why fixes don't compound. You're not building on a foundation. You're rearranging sand.
The fix is structural: separate what the LLM is good at from what code is good at.
LLMs excel at understanding intent, handling ambiguity, and working with natural language. They're unreliable at execution. Code is the opposite: rigid with language, but perfectly reliable once written.
The insight: don't ask the LLM to do things. Ask it to decide what should be done. Then execute that decision with deterministic code.
This means defining your agent's capabilities as discrete, testable operations. Call them primitives. The LLM's job is to figure out which primitives to call and in what order. The primitives' job is to execute reliably.
Now the loop can close:
- You benchmark the agent. Some tasks fail.
- You look at the failures. They're attributable. Either the LLM chose the wrong primitives, or a needed primitive is missing, or the LLM didn't understand how to decompose the problem.
- If a primitive is missing, you add it. If the LLM chose wrong, you add an example showing the correct pattern.
- You re-test. The improvement is verified. Previous capabilities don't regress because they're built on tested code.
- Repeat.
This is the flywheel: observe gap → attribute cause → add component → verify improvement → compound.
Each fix is additive. Each primitive you add becomes part of the foundation. Each example you provide teaches a pattern the agent can generalize. The system gets better in ways you can measure and trust.
The wall of text becomes a structured program. Vibes-based development becomes engineering.
What changes#
With this structure, the properties we need for a flywheel emerge naturally:
Attributable failures. When something breaks, you can trace it to a specific component. The LLM misunderstood the intent. Or it planned the wrong sequence. Or a primitive is missing. Each diagnosis points to a specific fix.
Additive improvements. Adding a new primitive doesn't risk breaking existing ones. Adding a new decomposition example doesn't corrupt previous patterns. You're building up, not reshuffling.
Reliable benchmarks. Because execution is deterministic, you can run the same test suite repeatedly and get consistent results. You can measure whether a change actually helped.
Compounding progress. Each capability you add becomes available for the LLM to compose with everything else. Ten primitives can combine in hundreds of ways. Twenty primitives, in thousands. The system's power grows faster than the effort you put in.
This is what model training has that agent development lacks: a way to turn observed failures into verified improvements that compound over time.
The bottom line#
The agent flywheel problem isn't about better tools or smarter prompts. It's a structural gap between how models improve and how agents don't.
Models have closed loops that compound. Agents have open loops that require constant human intervention.
The answer isn't to abandon probabilistic intelligence. It's to compose it with deterministic structure correctly. Let the LLM do what it's good at: understanding intent, handling ambiguity, reasoning about novel situations. Let code do what it's good at: executing reliably, maintaining state, producing consistent results.
When you separate these concerns, failures become attributable, fixes become additive, and improvements compound. The loop closes. The flywheel spins.
That's how we get from "works in demo" to "works at scale."