Closing the Flywheel in Practice

In The Missing Flywheel in Agent Building, we explained why prompt-based agents don't improve the way language models do. The loop never closes. Fixes don't compound.

This post shows the flywheel in action. We'll take a unit converter agent, watch it fail, diagnose exactly what's missing, add the fix, and verify the improvement. You'll see firsthand how structural separation creates attributable failures and additive improvements.

The setup#

We have a unit converter agent built with OpenSymbolicAI. It knows how to convert between common units: gallons, quarts, pints, cups, milliliters, liters. Each conversion is a primitive: a discrete, testable function that the LLM can invoke.

python

@primitive(read_only=True)
def gallons_to_quarts(self, gallons: float) -> float:
    """Convert gallons to quarts."""
    return gallons * 4

@primitive(read_only=True)
def quarts_to_pints(self, quarts: float) -> float:
    """Convert quarts to pints."""
    return quarts * 2

# ... more primitives for the full conversion chain

When you ask the agent to convert 20 gallons to liters, it chains these primitives together automatically:

text

gallons → quarts → pints → cups → ml → liters

The LLM figures out the path. The primitives execute reliably. This works.

Iteration 1: The first failure#

Now we push the agent somewhere it hasn't been.

python

response = agent.run("Convert 1 hogshead to liters")

It fails. The agent doesn't know what a hogshead is.

Here's where the flywheel matters. In a prompt-based agent, this failure would be buried in a wall of text. You'd wonder: Did the model misunderstand the question? Did it hallucinate a wrong conversion? Is the prompt confusing?

With primitives, the diagnosis is instant: the agent lacks hogshead primitives. The failure is attributable to a specific missing component.

The fix#

We add the missing primitives:

python

@primitive(read_only=True)
def gallons_to_hogsheads(self, gallons: float) -> float:
    """Convert gallons to hogsheads."""
    return gallons / 63

@primitive(read_only=True)
def hogsheads_to_gallons(self, hogsheads: float) -> float:
    """Convert hogsheads to gallons."""
    return hogsheads * 63

Run it again. The agent now chains:

text

hogshead → gallons → quarts → pints → cups → ml → liters

The fix is verified. Previous conversions still work because we added to the system, not modified a fragile prompt. The improvement is additive.

Flywheel turn 1 complete: Observe gap → attribute cause → add component → verify improvement.

Iteration 2: A different kind of failure#

Let's try something more complex:

python

response = agent.run(
    "Convert 3 cups of milk to liters and separately convert 2 beer pints to teaspoons"
)

The agent struggles. It might complete one conversion and forget the other. Or conflate them. Or produce malformed output.

Again, diagnosis is clear. The agent has all the primitives it needs. The LLM understands each conversion individually. What's missing is an understanding of how to decompose a multi-part request into parallel operations.

This isn't a prompt problem. It's a pattern problem. The agent hasn't seen an example of handling multiple conversions in one query.

The fix#

We add a decomposition, an example that teaches the pattern:

python

@decomposition(
    intent="Convert 4 tablespoons of honey to milliliters and 2 quarts of juice to cups",
    expanded_intent="Convert 4 tablespoons to cups then to milliliters for honey; "
                    "convert 2 quarts to pints then to cups for juice. "
                    "Return both results in a dictionary with labels and units.",
)
def _dual_conversion(self) -> dict:
    # Convert 4 tablespoons of honey to milliliters
    cups_from_tbsp = self.tbsp_to_cups(4)
    honey_ml = self.cups_to_ml(cups_from_tbsp)

    # Convert 2 quarts of juice to cups
    pints_from_quarts = self.quarts_to_pints(2)
    juice_cups = self.pints_to_cups(pints_from_quarts)

    return {
        "honey": {"value": honey_ml, "unit": "milliliters"},
        "juice": {"value": juice_cups, "unit": "cups"},
    }

This decomposition teaches the agent a pattern: when you see multiple conversions in one request, handle them separately and return structured results.

Run the original query again. Now the agent understands how to structure the response. It applies the pattern from the example to the new situation.

Flywheel turn 2 complete: Observe gap → attribute cause → add component → verify improvement.

What we've built#

After two iterations, the agent is meaningfully better:

Iteration	Failure	Diagnosis	Fix	Result
1	Can't convert hogsheads	Missing primitives	Added `hogsheads_to_gallons`, `gallons_to_hogsheads`	New unit supported
2	Can't handle multiple conversions	Missing pattern	Added decomposition example	Multi-conversion queries work

Notice what didn't happen:

We didn't tweak a prompt and hope
We didn't cause regressions in working functionality
We didn't spend hours debugging ambiguous failures

Each fix was surgical. Each improvement was verified. Each addition became part of the foundation for future capabilities.

The compounding effect#

Here's where it gets powerful.

Those two hogshead primitives we added? They now compose with everything else. The agent can convert hogsheads to milliliters, to cups, to quarts. Any path through the conversion graph works. Two primitives unlocked dozens of new capabilities.

That decomposition pattern we added? The agent now generalizes it. Ask for three conversions, or four. Ask for conversions with different output formats. The pattern transfers.

This is the flywheel spinning. Each component you add multiplies the system's capabilities, not adds to them. Ten primitives combine in hundreds of ways. Twenty primitives, in thousands.

With prompt engineering, doubling your prompt length doesn't double capabilities. It often makes things worse. With primitives and decompositions, more components mean more combinations, and the LLM navigates the combinations for you.

The closed loop#

View diagram source

flowchart TB
    A[Run benchmark] --> B{Failures?}
    B -->|Yes| C[Diagnose]
    C --> D{What's missing?}
    D -->|Primitive| E[Add primitive]
    D -->|Pattern| F[Add decomposition]
    E --> G[Re-test]
    F --> G
    G --> H[Verify fix]
    H --> I[No regression]
    I --> A
    B -->|No| J[Ship it]

    style A fill:#10b981,stroke:#059669,color:#fff
    style B fill:#10b981,stroke:#059669,color:#fff
    style C fill:#10b981,stroke:#059669,color:#fff
    style D fill:#10b981,stroke:#059669,color:#fff
    style E fill:#10b981,stroke:#059669,color:#fff
    style F fill:#10b981,stroke:#059669,color:#fff
    style G fill:#10b981,stroke:#059669,color:#fff
    style H fill:#10b981,stroke:#059669,color:#fff
    style I fill:#10b981,stroke:#059669,color:#fff
    style J fill:#10b981,stroke:#059669,color:#fff

This is engineering, not alchemy. Failures have causes. Causes have fixes. Fixes are verified. Nothing regresses.

The loop is closed.

Try it yourself#

The full unit converter example is available in our examples repository. Clone it, run through the steps, and experience the flywheel firsthand.

bash

git clone https://github.com/OpenSymbolicAI/examples-py.git
cd examples-py/examples/unit_converter
uv run python unit_converter/main.py

Break it. Fix it. Watch improvements compound.

That's how agents should work.