LLM Attention Is Precious: Why ReAct Wastes It

Every token in the context window competes for the model's attention. The more tokens, the more diluted that attention becomes. The dominant agent pattern, ReAct, floods the context with data the LLM has already seen, forcing it to re-read the same documents over and over.

What is ReAct?#

ReAct (Reasoning + Acting) is the standard pattern behind most "tool calling" agents. Introduced by Yao et al., it works through a Thought-Action-Observation loop:

Thought: The LLM reasons about what to do next (Chain of Thought)
Action: The LLM calls a tool
Observation: The tool result goes back into the context
Repeat: The LLM sees everything again and reasons about the next step

This is what LangChain agents, OpenAI function calling loops, and most agent frameworks implement. The problem? Each iteration requires the LLM to re-read the entire conversation history.

Let's see why with a simple example.

The Task#

Ask a RAG agent to compare two topics:

"Compare machine learning and deep learning"

This needs four steps:

Get documents about machine learning
Get documents about deep learning
Combine them
Write the comparison

ReAct: The Thought-Action-Observation Loop#

With ReAct, every step goes through the LLM. Each iteration includes all previous results because the model needs to reason about what to do next.

text

┌─────────────────────────────────────────────────────────┐
│  ITERATION 1                                            │
│                                                         │
│  THOUGHT: "I need ML docs first"                        │
│  LLM reads:                                             │
│    • System prompt ────────────────────── 2,500 tokens  │
│    • Tool definitions ───────────────────── 500 tokens  │
│    • User question ──────────────────────────10 tokens  │
│                                           ────────────  │
│                                            3,010 tokens │
│                                                         │
│  ACTION: retrieve("machine learning") ───── 100 tokens  │
│  OBSERVATION: 5 documents ───────────────  2,500 tokens │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│  ITERATION 2                                            │
│                                                         │
│  THOUGHT: "Now I need DL docs"                          │
│  LLM reads:                                             │
│    • Everything from iteration 1 ────────  3,010 tokens │
│    • Previous thought + action ────────────  100 tokens │
│    • ML documents (again!) ──────────────  2,500 tokens │ ← waste
│                                           ────────────  │
│                                            5,610 tokens │
│                                                         │
│  ACTION: retrieve("deep learning") ──────── 100 tokens  │
│  OBSERVATION: 5 documents ───────────────  2,500 tokens │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│  ITERATION 3                                            │
│                                                         │
│  THOUGHT: "Combine the contexts"                        │
│  LLM reads:                                             │
│    • Everything from iterations 1-2 ─────  5,610 tokens │
│    • Previous thought + action ────────────  100 tokens │
│    • DL documents (again!) ──────────────  2,500 tokens │ ← waste
│                                           ────────────  │
│                                            8,210 tokens │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│  ITERATION 4                                            │
│                                                         │
│  THOUGHT: "Compare the topics"                          │
│  LLM reads:                                             │
│    • Everything from iterations 1-3 ─────  8,210 tokens │
│    • Combined context (again!) ──────────  1,500 tokens │ ← waste
│                                           ────────────  │
│                                            9,810 tokens │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│  ITERATION 5                                            │
│                                                         │
│  THOUGHT: "I can now provide the answer"                │
│  LLM reads:                                             │
│    • Everything from iterations 1-4 ────  10,410 tokens │
│                                                         │
│  FINAL ANSWER ───────────────────────────── 300 tokens  │
└─────────────────────────────────────────────────────────┘

TOTAL TOKENS THROUGH LLM: ~37,000

The ML documents (2,500 tokens) are read 4 times. Three of those reads are pure waste: 7,500 tokens the LLM didn't need to see.

PlanAndExecute: The LLM Plans, Python Executes#

With PlanAndExecute, the LLM makes a plan once. Then Python runs it.

text

┌─────────────────────────────────────────────────────────┐
│  PLANNING CALL                                          │
│                                                         │
│  LLM reads:                                             │
│    • Function signatures ────────────────── 600 tokens  │
│    • Example patterns ─────────────────────  350 tokens │
│    • User question ──────────────────────────10 tokens  │
│                                           ────────────  │
│                                            1,010 tokens │ ← small!
│                                                         │
│  LLM outputs the plan:                                  │
│                                                         │
│    ml_docs = retrieve("machine learning")               │
│    dl_docs = retrieve("deep learning")                  │
│    ml_ctx = combine(ml_docs)                            │
│    dl_ctx = combine(dl_docs)                            │
│    return compare(ml_ctx, dl_ctx)        ─── 150 tokens │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│  PYTHON EXECUTION (no LLM!)                             │
│                                                         │
│  ml_docs = [...] ──────────────────  stays in Python    │
│  dl_docs = [...] ──────────────────  stays in Python    │
│  ml_ctx = "..." ───────────────────  stays in Python    │
│  dl_ctx = "..." ───────────────────  stays in Python    │
│                                                         │
│                                            0 LLM tokens │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│  COMPARE PRIMITIVE (uses LLM internally)                │
│                                                         │
│  LLM reads:                                             │
│    • ML context ───────────────────────── 1,500 tokens  │
│    • DL context ───────────────────────── 1,500 tokens  │
│    • "Compare these" ────────────────────── 100 tokens  │
│                                           ────────────  │
│                                            3,100 tokens │
│                                                         │
│  LLM outputs: comparison ────────────────── 500 tokens  │
└─────────────────────────────────────────────────────────┘

TOTAL TOKENS THROUGH LLM: ~5,000

Documents are fetched once and stay in Python memory. The LLM only sees them when it needs to actually compare them.

The Difference#

	ReAct	PlanAndExecute
LLM iterations	5	2
Tokens processed	37,000	5,000
Savings		~7x fewer

Why It Matters#

The gap grows with complexity:

More steps = more re-reading in ReAct
Bigger documents = more wasted tokens per re-read
Multiple agents = each agent re-reads everything

ReAct treats the LLM as a reasoning engine that must see all data to decide what to do next. Each "thought" requires the full context.

PlanAndExecute treats the LLM as a planner. It figures out what to do, then gets out of the way while Python handles how.

Send logic to the LLM. Keep data in Python.

The Ripple Effects#

The 7x token reduction isn't just about attention. It cascades:

Cost. You pay per token. 37,000 tokens vs 5,000 tokens is 7x cheaper per query. At scale, this is the difference between viable and not.

Latency. Each LLM call adds network round-trip time. 5 calls vs 2 calls means ~60% less waiting. Plus, smaller contexts process faster.

Reliability. Longer contexts increase the chance of the model losing track, hallucinating, or ignoring instructions buried in the middle. Shorter, focused contexts fail less.

Context ceiling. ReAct hits the context limit faster. A 10-step workflow with large documents can exhaust 100k tokens. PlanAndExecute stays small regardless of data size.

Parallelism. Python can run independent primitives in parallel. ReAct is sequential by design: each thought depends on observing the previous action's result.

The attention problem is the root cause. Everything else follows.

Learn more: Behaviour Programming vs Tool Calling | Illustration: Part 1, Part 2, Part 3

See the code: RAG Agent Example