Illustration Part 3: Cost & Reliability

In Part 1, we saw attention loss. In Part 2, we saw token accumulation. Now let's look at what this means for production systems.

Latency#

Metric	Behaviour	Tool-Calling
Total time (5 queries)	8.16s	17.93s
Average per query	1.63s	3.59s
Fastest query	0.85s	1.95s
Slowest query	2.52s	7.99s

The tool-calling approach is 2.2x slower on average. The worst case (Query 1 with 11 calls) took 8 seconds for a simple factual question.

Each LLM call has fixed latency overhead:

Network round-trip
Queue time
Model loading (for some providers)
Token generation

With behaviour programming, you pay this overhead once for planning, then execute locally. With tool-calling, you pay it on every decision.

Cost#

At typical API pricing (~$0.01 per 1K input tokens, ~$0.03 per 1K output tokens):

Approach	Input Tokens	Output Tokens	Estimated Cost
Behaviour	15,954	5,245	~$0.32
Tool-Calling	41,826	10,718	~$0.74

Tool-calling costs 2.3x more for the same 5 queries. At scale:

Scale	Behaviour	Tool-Calling	Difference
1,000 queries/day	$320	$740	$420/day
30 days	$9,600	$22,200	$12,600/month
Annual	$115,200	$266,400	$151,000/year

Reliability#

Error Rates#

In our test run:

Behaviour: 0 errors in 5 queries
Tool-calling: 1 error in 5 queries (20% failure rate)

The tool-calling error was an API-level failure triggered by the model outputting JSON when instructed to use a different format. Despite explicit instructions ("NEVER output JSON"), the model reverted to JSON after several turns of context accumulation.

This is the attention loss problem manifesting as failures.

Consistency#

The behaviour approach produces consistent execution patterns:

Query Type	Execution Pattern	Calls
Simple questions	retrieve → extract	2
Complex questions	retrieve → rerank → extract	10
Comparisons	retrieve(A) → retrieve(B) → compare	2

The tool-calling approach produces unpredictable patterns:

Same simple question → anywhere from 4 to 11 calls
Strategies ignored or applied inconsistently
No guarantee the model follows its own plan

Why Behaviour Programming Works#

The key insight: separate what to do from when to do it.

The LLM's job is pattern matching. Given a query, which decomposition example is most similar? The matched example provides an executable template. Python fills in the parameters and runs the plan.

python

@decomposition(
    intent="What is machine learning?",
    expanded_intent="Simple factual query: retrieve, combine, extract"
)
def _simple_qa(self) -> str:
    docs = self.retrieve("machine learning definition", k=3)
    context = self.combine_contexts(docs)
    return self.extract_answer(context, "What is machine learning?")

When a user asks "What is artificial intelligence?", the LLM recognizes the pattern, generates equivalent code, and execution happens in Python. The LLM is never asked "what should I do next?" mid-execution.

Compiler vs Orchestrator#

Tool-calling asks the LLM to be a runtime orchestrator: making decisions with incomplete information, managing state across turns, staying focused through growing context.

Behaviour programming asks the LLM to be a compiler: understanding intent, matching patterns, generating a complete plan upfront. Execution is deterministic.

The compiler role plays to LLM strengths (pattern matching, language understanding). The orchestrator role fights against LLM weaknesses (attention limits, context management, consistency).

Conclusion#

This illustration demonstrates three failure modes of tool-calling at scale:

Attention Loss: Instructions forgotten as context grows
Token Waste: 2-8x overhead from re-reading everything
Inconsistency: Same queries produce wildly different patterns

Behaviour programming addresses all three:

Plan once → instructions processed once
Execute in Python → no context accumulation
Match patterns → similar queries get similar treatment

The 60% token reduction and 55% speed improvement matter. But the real win is reliability: agents that follow instructions and behave predictably.

LLM attention is precious. Don't waste it on orchestration.

Series: Part 1: Attention Loss | Part 2: Token Economics | Part 3: Cost & Reliability ← you are here

Learn more: Behaviour Programming vs Tool Calling | Why Attention Is Precious

See the code: RAG Agent Example