Illustration Part 1: The Attention Loss Problem

We compared two approaches to building RAG agents:

Behaviour Programming: Plan once, execute in Python
Tool-Calling: LLM decides after each tool result

Metric	Behaviour	Tool-Calling	Difference
Total Tokens	21,199	52,544	60% reduction
Planning Calls	5	17	71% reduction
Total Time	8.16s	17.93s	55% faster
Avg Tokens/Query	4,240	10,509	2.5x fewer

But the numbers only tell part of the story. The more damning finding is attention loss.

What We Observed#

The tool-calling agent was given detailed instructions with query classification, chain-of-thought guidance, and common mistakes to avoid. Despite this, it consistently failed to follow its own instructions.

Query 1: "What is machine learning?"#

This is a Type 1 Simple Factual Question. The instructions clearly state:

Strategy: retrieve(k=3) → extract_answer → final_answer

What each agent actually did:

Agent	Execution Path	LLM Calls	Tokens
Behaviour	retrieve → extract_answer	2	2,862
Tool-calling	retrieve → extract_answer → retrieve → extract_answer → extract_answer → extract_answer → final_answer	11	22,895

The tool-calling agent made 11 calls for a simple factual question. It looped through retrieve and extract_answer multiple times, unable to recognize it already had sufficient information. This is a 5.5x overhead in calls and 8x overhead in tokens.

Query 4: "What is deep learning and how does it relate to neural networks?"#

This is a Type 2 Complex Technical Question. The instructions state:

Strategy: retrieve(k=5) → extract_answer (may need multiple) → final_answer

⚠️ IMPORTANT: For technical questions, you may need to retrieve more documents

Agent	Execution Path	LLM Calls	Tokens
Behaviour	retrieve(k=8) → rerank(8 docs) → extract_answer	10	5,625
Tool-calling	retrieve → extract_answer → final_answer	4	8,865

Here the tool-calling agent did the opposite: it under-executed, skipping the thorough processing the instructions called for. The behaviour agent used reranking to ensure high-quality results; the tool-calling agent rushed to an answer.

Why This Happens#

Token waste visualization showing how instructions get buried as context grows

The tool-calling prompt was ~1,000 tokens of instructions, including:

5 critical rules
5 query type classifications with strategies
6-step chain-of-thought process
6 common mistakes to avoid
Response format requirements

On the first call, the model sees all of this. By the third or fourth call, it's also processing:

The full instruction set (again)
The original question
All previous tool calls and their results
Retrieved document content (thousands of tokens)

The instructions get buried. The model loses track of which query type it classified, what strategy it planned, and whether it's done. It either loops unnecessarily or exits prematurely.

This is not a prompt engineering problem. Adding more instructions makes it worse. The fundamental issue is that tool-calling requires the LLM to re-read everything on every turn, and attention degrades as context grows.

The Core Insight#

Behaviour programming avoids this entirely. The LLM plans once, when context is small and instructions are fresh. Then Python executes the plan without further LLM involvement.

How behaviour programming preserves attention through isolated calls

The instructions are processed exactly once. They can't get buried because there's no accumulating context to bury them in.

Next: Part 2: Token Economics, where the tokens actually go

Series: Part 1: Attention Loss ← you are here | Part 2: Token Economics | Part 3: Cost & Reliability