Behaviour Programming vs. Tool Calling: Two Paradigms for AI Agents

Most AI agents today are built the same way: a massive prompt describing every tool, chain-of-thought examples scattered throughout, and hope that the LLM figures out what to do. This approach has a ceiling. Let's examine why, and what the alternative looks like.

The Tool Calling Paradigm#

Here's how a typical RAG agent gets built with the standard tool-calling approach. First, you define your tools as JSON schemas:

json

{
  "tools": [
    {
      "name": "retrieve",
      "description": "Searches the vector database for documents semantically similar to the query. Returns documents with content, relevance scores, and metadata.",
      "parameters": {
        "type": "object",
        "properties": {
          "query": {
            "type": "string",
            "description": "The search query. IMPORTANT: Rephrase the user's question into search-optimized keywords. DO NOT pass the raw user question."
          },
          "k": {
            "type": "integer",
            "description": "Number of documents to return. Default 5. Use 8-10 for complex queries.",
            "default": 5
          },
          "source_filter": {
            "type": "string",
            "description": "Optional. Filter by source (e.g., 'wikipedia', 'arxiv'). Only use if user explicitly requests a specific source.",
            "enum": ["wikipedia", "arxiv", "documentation", "forums"]
          }
        },
        "required": ["query"]
      }
    },
    {
      "name": "rerank",
      "description": "Re-ranks documents by relevance using LLM-based scoring. Use this AFTER retrieve when you need higher precision.",
      "parameters": {
        "type": "object",
        "properties": {
          "document_ids": {
            "type": "array",
            "items": { "type": "string" },
            "description": "IDs of documents to rerank. MUST be from a previous retrieve call."
          },
          "query": {
            "type": "string",
            "description": "The relevance query. Should match the user's intent, not the search query."
          },
          "k": {
            "type": "integer",
            "description": "Number of top documents to keep after reranking.",
            "default": 3
          }
        },
        "required": ["document_ids", "query"]
      }
    },
    {
      "name": "extract_answer",
      "description": "Extracts a direct answer from the provided context.",
      "parameters": {
        "type": "object",
        "properties": {
          "context_ids": {
            "type": "array",
            "items": { "type": "string" },
            "description": "Document IDs to use as context. MUST reference documents from retrieve/rerank."
          },
          "question": {
            "type": "string",
            "description": "The question to answer. Use the user's ORIGINAL question, not a rephrased version."
          }
        },
        "required": ["context_ids", "question"]
      }
    },
    {
      "name": "summarize",
      "description": "Summarizes content into a concise form.",
      "parameters": {
        "type": "object",
        "properties": {
          "context_ids": {
            "type": "array",
            "items": { "type": "string" },
            "description": "Document IDs to summarize."
          },
          "max_sentences": {
            "type": "integer",
            "description": "Maximum sentences in summary. Default 3.",
            "default": 3
          },
          "focus": {
            "type": "string",
            "description": "Optional focus area for the summary."
          }
        },
        "required": ["context_ids"]
      }
    },
    {
      "name": "validate_answer",
      "description": "Validates that an answer is fully supported by source documents.",
      "parameters": {
        "type": "object",
        "properties": {
          "answer": { "type": "string", "description": "The answer to validate." },
          "context_ids": {
            "type": "array",
            "items": { "type": "string" },
            "description": "Source document IDs to validate against."
          }
        },
        "required": ["answer", "context_ids"]
      }
    }
  ]
}

Then you need the system prompt with chain-of-thought instructions, critical rules, and structured output requirements:

python

SYSTEM_PROMPT = """
You are a RAG assistant. You answer questions using ONLY information retrieved
from the knowledge base. You have access to tools defined in the tools schema.

═══════════════════════════════════════════════════════════════════════════════
                         ⚠️  CRITICAL INSTRUCTIONS  ⚠️
═══════════════════════════════════════════════════════════════════════════════

1. NEVER hallucinate or make up information. If the retrieved documents don't
   contain the answer, say "I couldn't find information about that."

2. NEVER call extract_answer without first calling retrieve. The context_ids
   parameter MUST reference real document IDs from a retrieve call.

3. ALWAYS cite your sources. Include document IDs in your final response.

4. DO NOT call multiple tools in parallel. Wait for each tool result before
   proceeding. Tool calls are sequential.

5. IMPORTANT: The retrieve tool returns document IDs, not content. You cannot
   read document content directly. Use extract_answer or summarize to process
   retrieved documents.

═══════════════════════════════════════════════════════════════════════════════
                           RESPONSE FORMAT (REQUIRED)
═══════════════════════════════════════════════════════════════════════════════

You MUST respond with valid JSON matching this schema:

{
  "thinking": "Your step-by-step reasoning about how to approach this query",
  "tool_calls": [
    {
      "tool": "tool_name",
      "parameters": { ... },
      "reasoning": "Why you're calling this tool"
    }
  ],
  "final_answer": "Your answer to the user (only after all tool calls complete)",
  "sources": ["doc_id_1", "doc_id_2"],
  "confidence": "high|medium|low"
}

IMPORTANT: Do not include final_answer until you have completed all necessary
tool calls and received their results.

═══════════════════════════════════════════════════════════════════════════════
                        QUERY CLASSIFICATION & ROUTING
═══════════════════════════════════════════════════════════════════════════════

Before calling any tools, classify the user's query:

## Type 1: Simple Factual Questions
Keywords: "what is", "define", "who is", "when did"
Strategy: retrieve(k=3) → extract_answer
Example: "What is machine learning?"

## Type 2: Complex Technical Questions
Keywords: "explain", "how does", "describe the architecture", technical jargon
Strategy: retrieve(k=8) → rerank(k=3) → extract_answer
Example: "Explain the attention mechanism in transformers"
⚠️ IMPORTANT: For technical questions, ALWAYS use rerank. Initial retrieval
often includes tangentially related documents.

## Type 3: Overview/Summary Requests
Keywords: "overview", "summary", "tell me about", "what are the main"
Strategy: retrieve(k=5) → summarize
Example: "Give me an overview of recent AI developments"
⚠️ DO NOT use extract_answer for summaries. Use summarize tool.

## Type 4: Multi-hop Questions
Keywords: Questions with multiple parts, "and", questions about relationships
Strategy: retrieve → extract_answer → retrieve(follow-up) → extract_answer → combine
Example: "Who founded OpenAI and what is their current role?"
⚠️ CRITICAL: Multi-hop requires MULTIPLE retrieve calls. Do not try to answer
both parts from a single retrieval.

## Type 5: Comparison Questions
Keywords: "compare", "difference between", "vs", "better"
Strategy: retrieve(topic A) → retrieve(topic B) → extract comparative answer
Example: "Compare Python and Rust for systems programming"

## Type 6: Accuracy-Critical Queries
Keywords: "make sure", "verify", "accurate", "definitely", "confirm"
Strategy: retrieve → extract_answer → validate_answer
Example: "What are the side effects of aspirin? Make sure it's accurate."
⚠️ ALWAYS run validate_answer when user requests accuracy. If validation
fails, use summarize for a more conservative response.

═══════════════════════════════════════════════════════════════════════════════
                              CHAIN OF THOUGHT
═══════════════════════════════════════════════════════════════════════════════

For each query, think through these steps IN ORDER:

1. CLASSIFY: What type of query is this? (Type 1-6 above)
2. PLAN: What sequence of tool calls do I need?
3. SEARCH TERMS: What keywords should I use? (NOT the raw question)
4. EXECUTE: Call tools one at a time, waiting for results
5. VALIDATE: Do I have enough information? Do I need more retrieval?
6. RESPOND: Formulate answer with citations

Example reasoning:

User: "Explain how BERT handles bidirectional context"

{
  "thinking": "This is a Type 2 complex technical question about BERT
    architecture. I should: 1) retrieve with k=8 using search terms
    'BERT bidirectional context attention mechanism', 2) rerank to get
    the most relevant technical documents, 3) extract the specific
    answer about bidirectional processing.",
  "tool_calls": [
    {
      "tool": "retrieve",
      "parameters": {
        "query": "BERT bidirectional context attention mechanism architecture",
        "k": 8
      },
      "reasoning": "Technical question needs broad initial retrieval"
    }
  ]
}

... [after receiving retrieve results] ...

{
  "thinking": "Retrieved 8 documents. Now I need to rerank to find the
    most relevant ones for the specific question about bidirectional
    context handling.",
  "tool_calls": [
    {
      "tool": "rerank",
      "parameters": {
        "document_ids": ["doc_1", "doc_3", "doc_4", "doc_7", ...],
        "query": "how BERT processes bidirectional context",
        "k": 3
      },
      "reasoning": "Filter to most relevant technical content"
    }
  ]
}

═══════════════════════════════════════════════════════════════════════════════
                              COMMON MISTAKES
═══════════════════════════════════════════════════════════════════════════════

❌ DON'T: Pass user's raw question to retrieve
✅ DO: Rephrase into search-optimized keywords

❌ DON'T: Skip rerank for technical questions
✅ DO: Always rerank when precision matters

❌ DON'T: Use extract_answer for summaries
✅ DO: Use summarize tool for overview requests

❌ DON'T: Answer multi-hop questions in one retrieval
✅ DO: Break into multiple retrieve → extract cycles

❌ DON'T: Provide answer without tool calls
✅ DO: Always retrieve first, even if you think you know the answer

❌ DON'T: Forget to validate accuracy-critical queries
✅ DO: Always call validate_answer when user requests verification

═══════════════════════════════════════════════════════════════════════════════

Remember: You are a RETRIEVAL assistant. Your knowledge comes from the
knowledge base, not from your training data. When in doubt, retrieve more.
"""

This works. Up to a point.

The Problems with Massive Prompts#

The prompt grows linearly with capabilities. Every new tool needs a description, parameter documentation, and usage examples. A production agent might have 30+ tools. That's thousands of tokens of instructions before you even get to the user's question.

Chain-of-thought is unreliable at the edges. The LLM follows patterns well when the input matches your examples. But what about inputs that fall between categories? Is "Tell me about quantum computing, but only from academic sources" a filtered retrieval, or regular retrieval plus validation? The prompt doesn't say, so the LLM guesses.

Failures are unattributable. When the agent picks the wrong strategy, where did it go wrong? Was the tool description unclear? Was the chain-of-thought instruction missing a case? Did an example teach the wrong pattern? You're debugging a wall of text.

Improvements don't compound. You fix one failure by adding an example. This makes the prompt longer. Longer prompts are harder for the model to follow reliably. The next failure might be caused by the fix you just added. You're playing whack-a-mole.

The Behaviour Programming Paradigm#

OpenSymbolicAI takes a different approach. Instead of describing tools in prose and reasoning in chain-of-thought examples, you define primitives as code and teach decomposition patterns through executable examples.

Here's what the same RAG agent looks like (full example: examples-py/rag_agent):

python

class RAGAgent(PlanExecute):

    # PRIMITIVES: What the agent can do

    @primitive(read_only=True)
    def retrieve(self, query: str, k: int = 5) -> list[Document]:
        """
        Retrieve top-k documents semantically similar to the query.
        Returns documents with content, relevance scores, and metadata.
        """
        return self.retriever.query(query, k=k)

    @primitive(read_only=True)
    def rerank(self, documents: list[Document], query: str, k: int = 3) -> list[Document]:
        """
        Rerank documents by relevance using LLM scoring.
        Use this when initial retrieval may include less relevant results.
        """
        # Implementation details...
        return top_k_documents

    @primitive(read_only=True)
    def extract_answer(self, context: str, question: str) -> str:
        """
        Extract a direct answer from context.
        Use this after retrieving and combining relevant documents.
        """
        # Implementation details...
        return answer

So far, similar to tool definitions. The difference is in how we teach the agent when to use these primitives:

python

# DECOMPOSITIONS: How to solve different types of problems

    @decomposition(
        intent="What is machine learning?",
        expanded_intent="Simple factual query: retrieve relevant documents, "
                       "combine into context, extract answer directly",
    )
    def _simple_qa(self) -> str:
        """Basic RAG: retrieve -> combine -> extract"""
        docs = self.retrieve("machine learning definition basics", k=3)
        context = self.combine_contexts(docs)
        return self.extract_answer(context, "What is machine learning?")

    @decomposition(
        intent="Explain the architectural innovations in transformer models",
        expanded_intent="Complex technical query needing high relevance: "
                       "retrieve many documents, rerank for best matches",
    )
    def _reranked_qa(self) -> str:
        """Reranked RAG: retrieve(many) -> rerank -> combine -> extract"""
        docs = self.retrieve("transformer architecture innovations", k=8)
        top_docs = self.rerank(docs, "transformer model innovations", k=3)
        context = self.combine_contexts(top_docs)
        return self.extract_answer(context, "What are the architectural innovations?")

Notice what's different: the examples are executable code, not prose.

Why This Matters#

Decompositions are testable. Each @decomposition method can be run independently. You can verify that _simple_qa() produces the right answer, that _reranked_qa() actually improves relevance. The examples aren't just documentation; they're tests.

The LLM learns composition patterns. When the agent sees a new query like "Explain attention mechanisms in detail," it doesn't pattern-match against text. It recognizes the structure: complex technical query → needs reranking. The decomposition examples teach the mapping from intent to execution pattern.

Failures are attributable. If the agent uses the wrong strategy, you can see exactly which decomposition pattern it matched (or failed to match). If a primitive fails, you know which primitive. The separation between "what strategy to use" and "how to execute the strategy" makes debugging tractable.

Improvements compound. Adding a new decomposition example doesn't make existing examples worse. Adding a new primitive doesn't change how existing primitives work. Each capability you add becomes part of a growing library that the agent can compose.

A Concrete Comparison#

Let's see how both approaches handle a multi-hop question: "Who created Python and what company does the creator work for now?"

Tool Calling Approach#

The prompt needs to explain multi-hop reasoning in prose:

text

For multi-hop questions (questions requiring multiple pieces of info):
- Call retrieve for the first part
- Call extract_answer for partial result
- Formulate a follow-up query based on what's missing
- Call retrieve again with the follow-up
- Combine both answers

The LLM must:

Recognize this is a multi-hop question
Remember the multi-hop instructions from the prompt
Generate the right sequence of tool calls
Keep track of intermediate state across calls
Combine results appropriately

Each step is a chance for error.

Behaviour Programming Approach#

python

@decomposition(
    intent="Who created Python and what company does the creator work for now?",
    expanded_intent="Multi-hop query requiring chained reasoning: first retrieve "
                   "to find the creator, then generate follow-up query about their "
                   "current work, retrieve again, and aggregate both answers",
)
def _multi_hop_qa(self) -> str:
    # First hop: find the creator
    docs1 = self.retrieve("Python programming language creator", k=3)
    context1 = self.combine_contexts(docs1)
    creator_info = self.extract_answer(context1, "Who created Python?")

    # Generate follow-up query
    followup = self.generate_followup_query(
        "Who created Python and what company does the creator work for now?",
        creator_info,
    )

    # Second hop: find current work
    docs2 = self.retrieve(followup, k=3)
    context2 = self.combine_contexts(docs2)
    work_info = self.extract_answer(context2, followup)

    # Aggregate answers
    return self.aggregate_answers(
        [creator_info, work_info],
        "Who created Python and what company does the creator work for now?",
    )

The LLM sees an executable example of multi-hop reasoning. It doesn't need to infer the pattern from prose instructions. The structure is explicit in the code.

When a similar question arrives ("Who founded OpenAI and what are they working on now?"), the agent recognizes the multi-hop pattern and executes the same structure with different parameters.

The Fundamental Difference#

Tool calling with chain-of-thought treats the LLM as a text-completion engine that needs to be told everything in natural language.

Behaviour programming treats the LLM as a pattern-matching engine that learns from examples of executable behavior.

The prompt-based approach says: "Here's how to think about different types of problems. Good luck!"

The behaviour-based approach says: "Here are solved examples. Match the pattern and compose the primitives."

This is why one approach hits a ceiling and the other compounds.

When Each Approach Makes Sense#

Tool calling with massive prompts works well for:

Simple agents with few tools (under 5)
One-shot tasks without complex routing
Prototyping before you know the patterns

Behaviour programming works well for:

Production agents that need reliability
Complex workflows with multiple strategies
Systems that need to improve over time
Teams that want testable, maintainable agents

The Engineering Insight#

The shift from tool calling to behaviour programming mirrors a pattern we've seen before in software engineering: moving from imperative to declarative.

Early web development meant writing step-by-step DOM manipulation. Modern frameworks declare the desired state and let the runtime figure out the steps.

Early agent development means writing step-by-step chain-of-thought. Behaviour programming declares the desired patterns and lets the LLM figure out which pattern applies.

The result is the same: cleaner abstractions, better testability, and systems that actually improve over time.

The wall of text becomes structured code. Vibes-based prompting becomes engineering.

See the numbers: Illustration Part 1, Part 2, Part 3. 60% fewer tokens, 55% faster, zero errors.

Want to see behaviour programming in action? Check out the complete RAG agent example with six different retrieval strategies defined as decompositions.