Illustration Part 3: Cost & Reliability
Tool-calling costs 2.3x more and fails 20% of the time. Here's the math.
In Part 1, we saw attention loss. In Part 2, we saw token accumulation. Now let's look at what this means for production systems.
Latency#
| Metric | Behaviour | Tool-Calling |
|---|---|---|
| Total time (5 queries) | 8.16s | 17.93s |
| Average per query | 1.63s | 3.59s |
| Fastest query | 0.85s | 1.95s |
| Slowest query | 2.52s | 7.99s |
The tool-calling approach is 2.2x slower on average. The worst case (Query 1 with 11 calls) took 8 seconds for a simple factual question.
Each LLM call has fixed latency overhead:
- Network round-trip
- Queue time
- Model loading (for some providers)
- Token generation
With behaviour programming, you pay this overhead once for planning, then execute locally. With tool-calling, you pay it on every decision.
Cost#
At typical API pricing (~$0.01 per 1K input tokens, ~$0.03 per 1K output tokens):
| Approach | Input Tokens | Output Tokens | Estimated Cost |
|---|---|---|---|
| Behaviour | 15,954 | 5,245 | ~$0.32 |
| Tool-Calling | 41,826 | 10,718 | ~$0.74 |
Tool-calling costs 2.3x more for the same 5 queries. At scale:
| Scale | Behaviour | Tool-Calling | Difference |
|---|---|---|---|
| 1,000 queries/day | $320 | $740 | $420/day |
| 30 days | $9,600 | $22,200 | $12,600/month |
| Annual | $115,200 | $266,400 | $151,000/year |
Reliability#
Error Rates#
In our test run:
- Behaviour: 0 errors in 5 queries
- Tool-calling: 1 error in 5 queries (20% failure rate)
The tool-calling error was an API-level failure triggered by the model outputting JSON when instructed to use a different format. Despite explicit instructions ("NEVER output JSON"), the model reverted to JSON after several turns of context accumulation.
This is the attention loss problem manifesting as failures.
Consistency#
The behaviour approach produces consistent execution patterns:
| Query Type | Execution Pattern | Calls |
|---|---|---|
| Simple questions | retrieve → extract | 2 |
| Complex questions | retrieve → rerank → extract | 10 |
| Comparisons | retrieve(A) → retrieve(B) → compare | 2 |
The tool-calling approach produces unpredictable patterns:
- Same simple question → anywhere from 4 to 11 calls
- Strategies ignored or applied inconsistently
- No guarantee the model follows its own plan
Why Behaviour Programming Works#
The key insight: separate what to do from when to do it.
The LLM's job is pattern matching. Given a query, which decomposition example is most similar? The matched example provides an executable template. Python fills in the parameters and runs the plan.
@decomposition(
intent="What is machine learning?",
expanded_intent="Simple factual query: retrieve, combine, extract"
)
def _simple_qa(self) -> str:
docs = self.retrieve("machine learning definition", k=3)
context = self.combine_contexts(docs)
return self.extract_answer(context, "What is machine learning?")When a user asks "What is artificial intelligence?", the LLM recognizes the pattern, generates equivalent code, and execution happens in Python. The LLM is never asked "what should I do next?" mid-execution.
Compiler vs Orchestrator#
Tool-calling asks the LLM to be a runtime orchestrator: making decisions with incomplete information, managing state across turns, staying focused through growing context.
Behaviour programming asks the LLM to be a compiler: understanding intent, matching patterns, generating a complete plan upfront. Execution is deterministic.
The compiler role plays to LLM strengths (pattern matching, language understanding). The orchestrator role fights against LLM weaknesses (attention limits, context management, consistency).
Conclusion#
This illustration demonstrates three failure modes of tool-calling at scale:
- Attention Loss: Instructions forgotten as context grows
- Token Waste: 2-8x overhead from re-reading everything
- Inconsistency: Same queries produce wildly different patterns
Behaviour programming addresses all three:
- Plan once → instructions processed once
- Execute in Python → no context accumulation
- Match patterns → similar queries get similar treatment
The 60% token reduction and 55% speed improvement matter. But the real win is reliability: agents that follow instructions and behave predictably.
LLM attention is precious. Don't waste it on orchestration.
Series: Part 1: Attention Loss | Part 2: Token Economics | Part 3: Cost & Reliability ← you are here
Learn more: Behaviour Programming vs Tool Calling | Why Attention Is Precious
See the code: RAG Agent Example