Back to Blog

Illustration Part 3: Cost & Reliability

Tool-calling costs 2.3x more and fails 20% of the time. Here's the math.

OpenSymbolicAI TeamFebruary 1, 20264 min read
agentsillustrationperformanceRAGbehaviour-programming

In Part 1, we saw attention loss. In Part 2, we saw token accumulation. Now let's look at what this means for production systems.

Latency#

MetricBehaviourTool-Calling
Total time (5 queries)8.16s17.93s
Average per query1.63s3.59s
Fastest query0.85s1.95s
Slowest query2.52s7.99s

The tool-calling approach is 2.2x slower on average. The worst case (Query 1 with 11 calls) took 8 seconds for a simple factual question.

Each LLM call has fixed latency overhead:

  • Network round-trip
  • Queue time
  • Model loading (for some providers)
  • Token generation

With behaviour programming, you pay this overhead once for planning, then execute locally. With tool-calling, you pay it on every decision.

Cost#

At typical API pricing (~$0.01 per 1K input tokens, ~$0.03 per 1K output tokens):

ApproachInput TokensOutput TokensEstimated Cost
Behaviour15,9545,245~$0.32
Tool-Calling41,82610,718~$0.74

Tool-calling costs 2.3x more for the same 5 queries. At scale:

ScaleBehaviourTool-CallingDifference
1,000 queries/day$320$740$420/day
30 days$9,600$22,200$12,600/month
Annual$115,200$266,400$151,000/year

Reliability#

Error Rates#

In our test run:

  • Behaviour: 0 errors in 5 queries
  • Tool-calling: 1 error in 5 queries (20% failure rate)

The tool-calling error was an API-level failure triggered by the model outputting JSON when instructed to use a different format. Despite explicit instructions ("NEVER output JSON"), the model reverted to JSON after several turns of context accumulation.

This is the attention loss problem manifesting as failures.

Consistency#

The behaviour approach produces consistent execution patterns:

Query TypeExecution PatternCalls
Simple questionsretrieve → extract2
Complex questionsretrieve → rerank → extract10
Comparisonsretrieve(A) → retrieve(B) → compare2

The tool-calling approach produces unpredictable patterns:

  • Same simple question → anywhere from 4 to 11 calls
  • Strategies ignored or applied inconsistently
  • No guarantee the model follows its own plan

Why Behaviour Programming Works#

The key insight: separate what to do from when to do it.

The LLM's job is pattern matching. Given a query, which decomposition example is most similar? The matched example provides an executable template. Python fills in the parameters and runs the plan.

python
@decomposition(
    intent="What is machine learning?",
    expanded_intent="Simple factual query: retrieve, combine, extract"
)
def _simple_qa(self) -> str:
    docs = self.retrieve("machine learning definition", k=3)
    context = self.combine_contexts(docs)
    return self.extract_answer(context, "What is machine learning?")

When a user asks "What is artificial intelligence?", the LLM recognizes the pattern, generates equivalent code, and execution happens in Python. The LLM is never asked "what should I do next?" mid-execution.

Compiler vs Orchestrator#

Tool-calling asks the LLM to be a runtime orchestrator: making decisions with incomplete information, managing state across turns, staying focused through growing context.

Behaviour programming asks the LLM to be a compiler: understanding intent, matching patterns, generating a complete plan upfront. Execution is deterministic.

The compiler role plays to LLM strengths (pattern matching, language understanding). The orchestrator role fights against LLM weaknesses (attention limits, context management, consistency).

Conclusion#

This illustration demonstrates three failure modes of tool-calling at scale:

  1. Attention Loss: Instructions forgotten as context grows
  2. Token Waste: 2-8x overhead from re-reading everything
  3. Inconsistency: Same queries produce wildly different patterns

Behaviour programming addresses all three:

  • Plan once → instructions processed once
  • Execute in Python → no context accumulation
  • Match patterns → similar queries get similar treatment

The 60% token reduction and 55% speed improvement matter. But the real win is reliability: agents that follow instructions and behave predictably.

LLM attention is precious. Don't waste it on orchestration.


Series: Part 1: Attention Loss | Part 2: Token Economics | Part 3: Cost & Reliability ← you are here

Learn more: Behaviour Programming vs Tool Calling | Why Attention Is Precious

See the code: RAG Agent Example