Back to Blog

The Anatomy of PlanExecute: Why It Is What It Is

A deep dive into the design decisions behind OpenSymbolicAI's core blueprint: why plans are code, why execution is sandboxed, and why the planner only gets called once.

OpenSymbolicAI TeamFebruary 19, 202615 min read
architecturePlanExecuteagentsdesign

PlanExecute is the core blueprint of OpenSymbolicAI. Every agent you build, from a calculator to a RAG system, is a subclass of PlanExecute.

This post explains why every design decision in PlanExecute exists. Not what it does, but why it is what it is.

The Core Loop#

PlanExecute has exactly three phases:

text
┌───────────────────────────────────────────────────────────┐
│  1. PLAN                                                  │
│     LLM generates code (assignment statements)            │
│     from primitive signatures + decomposition examples    │
│     ONE call. No loop.                                    │
└────────────────┬──────────────────────────────────────────┘


┌───────────────────────────────────────────────────────────┐
│  2. VALIDATE                                              │
│     AST parsing rejects anything dangerous                │
│     No loops, no imports, no eval, no private access      │
│     Allowlist, not blocklist                              │
└────────────────┬──────────────────────────────────────────┘


┌───────────────────────────────────────────────────────────┐
│  3. EXECUTE                                               │
│     Runtime evaluates each statement step-by-step         │
│     Full tracing: args, results, timing, state            │
│     LLM use within primitives is engineer-controlled      │
└───────────────────────────────────────────────────────────┘

Every decision in PlanExecute traces back to keeping this loop clean, safe, and observable.

Decision 1: Plans Are Code#

Plans are not JSON. Not YAML. Not natural language instructions. They are code: assignment statements that call primitives and compose their results.

python
# A plan generated by PlanExecute
ml_docs = retrieve(query="machine learning fundamentals", k=5)
dl_docs = retrieve(query="deep learning architectures", k=5)
combined = combine_contexts(documents=ml_docs + dl_docs)
answer = extract_answer(context=combined, question="Compare ML and DL")

Why code?

Because code is the only format that is simultaneously:

  1. Executable. Plans don't need a custom interpreter. The host language runtime handles it.
  2. Composable. The output of one function is the input to the next, naturally.
  3. Parseable. Standard AST tooling gives us a complete syntax tree. We can validate, analyze, and transform plans before running them.
  4. Precise. No ambiguity. retrieve(query="ML", k=5) means exactly one thing.
  5. Testable. You can run a plan against mock primitives and assert on the output. Plans are programs, and programs have test suites.
  6. Debuggable. Step through a plan line by line. Inspect every variable. Standard debugging workflows apply.
  7. Versionable. Plans are text. They go into git. You can diff two plans, review changes, and track how agent behavior evolves over time.
  8. Familiar. Developers already know how to read, write, and review code.

JSON tool-call schemas force you to build a routing layer. Natural language plans force you to build a parser. Code plans need neither. The language runtime handles execution, and the standard library handles validation.

When the LLM generates code, it's not "describing what to do." It's writing a program. LLMs are better at writing programs than describing programs.

The current implementation uses Python, but the architecture is language-agnostic. The same plan-validate-execute loop works in any language with an AST parser and a controlled execution environment.

Decision 2: Primitives Are Decorated Methods#

You define what the agent can do using the @primitive decorator:

python
class DateAgent(PlanExecute):
    @primitive(read_only=True)
    def today(self) -> str:
        """Get today's date in ISO format."""
        return date.today().isoformat()

    @primitive(read_only=True)
    def days_between(self, start: str, end: str) -> int:
        """Calculate the number of days between two ISO dates."""
        d1 = date.fromisoformat(start)
        d2 = date.fromisoformat(end)
        return (d2 - d1).days

Why decorators instead of tool schemas?

Because a decorated method is the implementation. In tool-calling frameworks, the tool schema and the tool implementation are separate artifacts that can drift apart. The description says one thing, the code does another, and the LLM gets confused.

With @primitive, the function signature is the schema. The docstring is the description. The type hints are the parameter types. There's one source of truth, and it's the code itself.

The read_only flag is not an afterthought. It's a security boundary. Read-only primitives execute freely. Mutations require approval. This distinction is enforced by the execution engine, not by prompting. More on this below.

Decision 3: Decompositions Teach Through Code#

Decompositions are executable examples that teach the LLM how to compose primitives:

python
@decomposition(
    intent="How many days until Christmas 2025?",
    expanded_intent="Get today's date, then calculate days between today and the target date",
)
def _days_until_christmas(self) -> int:
    today = self.today()
    result = self.days_between(start=today, end="2025-12-25")
    return result

Why executable examples instead of chain-of-thought prompts?

Three reasons.

Decompositions are testable. You can run _days_until_christmas() and verify it returns the right number. If your example is wrong, you'll know before the agent ever sees it. Chain-of-thought examples are documentation; decompositions are test cases.

Decompositions are extracted automatically. PlanExecute uses reflection and AST parsing to pull the function body at runtime. You write code; the prompt assembles itself. No manual synchronization between examples and signatures.

Decompositions compose. The LLM sees the structural pattern: "get a value, pass it to the next function, return the result." When a new query like "How many days until my birthday?" arrives, the model doesn't match text. It matches the composition pattern and adapts.

The intent parameter tells the LLM when to apply this pattern. The expanded_intent explains why this composition makes sense. The code shows how. Intent, rationale, implementation, all in one artifact.

Decision 4: The Planner Gets Called Once#

PlanExecute calls the LLM exactly once per task to generate the plan. After that, the execution engine runs each statement through the primitives the engineer has built. Any LLM usage during execution happens inside those primitives, controlled by the developer's code, not the agent loop.

This is the single most important design decision.

ReAct calls the LLM at every step. Each iteration re-reads the full context: system prompt, tool definitions, all previous thoughts, all previous observations. A 5-step workflow reads documents 4+ times. Tokens grow quadratically. Cost, latency, and failure rate all grow with it.

PlanExecute separates planning from execution. The planner sees only function signatures, decomposition examples, and the task. It outputs a program. The runtime executes the program. Data flows through primitives. If a primitive like extract_answer needs an LLM internally, the engineer's implementation decides exactly what data to send and how.

text
ReAct (5 steps):    ~37,000 tokens through the planner
PlanExecute:         ~1,000 tokens through the planner
                     + targeted LLM calls within primitives (engineer-controlled)

The difference: in ReAct, the agent loop controls data flow. Every piece of data passes through the context window because the LLM needs to reason about what to do next. In PlanExecute, the code controls data flow. The planner never sees the data. Primitives use LLMs surgically, sending only what's needed for their specific task.

But cost savings aren't the only reason. The other reason is determinism. Once the plan is generated, execution follows the code. The same plan always produces the same result (given the same primitive implementations). You can replay it. You can debug it. You can test it. You can checkpoint it and resume it on a different machine. None of this is possible when the agent loop is making decisions at every step.

Decision 5: Validation Uses the AST, Not Regex#

Before any plan runs, PlanExecute validates it using the language's AST module. Not regex. Not string matching. The actual abstract syntax tree.

python
# These are REJECTED at the AST level:
if x > 5:           # No conditionals
for item in items:   # No loops
import os            # No imports
exec("code")         # No dynamic execution
x._private           # No private attribute access
unknown_func()       # No unlisted function calls

Why AST-level validation?

Because regex can be tricked. A clever LLM output might slip a dangerous operation past a text-based filter. The AST cannot be fooled. It represents the actual parse tree of the code. If the tree contains an import node, the code imports something. Period.

PlanExecute validates three things:

  1. Every top-level statement is an assignment. No bare expressions, no side-effect-only statements.
  2. No disallowed syntax nodes. Conditionals, loops, try/except, function definitions, imports, raise, assert, delete. All rejected.
  3. Every function call targets a known primitive or safe builtin. If it's not in the allowlist, it doesn't exist.

This is an allowlist model. The validator doesn't try to enumerate what's dangerous. It enumerates what's allowed and rejects everything else. The default-deny posture means new attack vectors don't work unless they somehow fit within an assignment calling a blessed function.

Decision 6: Execution Is Sandboxed#

Even after validation, the plan runs in a stripped environment:

python
exec(
    compiled_statement,
    {"__builtins__": {}},   # Empty builtins, nothing available
    namespace,              # Only primitives + safe builtins + user variables
)

The execution namespace contains exactly:

  • The registered primitives
  • A whitelist of safe builtins (len, range, str, list, dict, etc.)
  • Variables from previous statements in the same plan

Nothing else. No filesystem access. No network access. No dynamic evaluation. These don't exist in the execution universe. Even if a plan somehow passed validation containing a dangerous call, it would fail at runtime because the capability isn't in the namespace.

This is defense in depth. Validation catches problems statically. The sandbox catches them at runtime. Both layers have to be defeated for anything dangerous to execute.

Decision 7: Every Step Is Traced#

Each statement execution produces an ExecutionStep with complete observability:

python
ExecutionStep(
    step_number=2,
    statement='answer = extract_answer(context=combined, question="Compare ML and DL")',
    variable_name="answer",
    primitive_called="extract_answer",
    args={
        "context": ArgumentValue(
            expression="combined",
            resolved_value="Machine learning is...[full text]...",
            variable_reference="combined",
        ),
        "question": ArgumentValue(
            expression='"Compare ML and DL"',
            resolved_value="Compare ML and DL",
            variable_reference=None,
        ),
    },
    namespace_before={"ml_docs": [...], "dl_docs": [...], "combined": "..."},
    namespace_after={..., "answer": "Machine learning encompasses..."},
    result_type="str",
    result_value="Machine learning encompasses...",
    time_seconds=0.342,
    success=True,
)

Why this level of tracing?

Because agents fail. And when they fail, you need to know exactly what happened. Not "the agent returned an error," but "step 2 called extract_answer with context=combined (which resolved to 'Machine learning is...') and question='Compare ML and DL', and it failed because..."

The trace captures:

  • What was called. The primitive name.
  • What was passed. Both the expression (combined) and the resolved value (the actual data).
  • The full state. Namespace before and after each step.
  • Timing. How long each step took.
  • Success or failure. With the error message if it failed.

This makes debugging concrete. When a plan produces wrong results, you can walk through the trace step by step, see every intermediate value, and pinpoint where things diverged.

For regulated industries, this trace is the audit trail. Every action the agent took, with what data, producing what result. Compliance teams can inspect it. Security teams can search it. No ambiguity.

Decision 8: Mutations Are Gated#

Primitives are tagged as either read-only or mutation:

python
@primitive(read_only=True)     # Executes freely
def search(self, query: str) -> list[Document]: ...

@primitive(read_only=False)    # Requires approval
def delete_document(self, doc_id: str) -> bool: ...

When a plan calls a mutation, PlanExecute can pause execution, yield a checkpoint, and wait for approval before continuing.

Why build this into the execution engine?

Because the alternative is hoping the LLM respects your prompt saying "don't delete things without asking." That's not a security model; that's a suggestion.

PlanExecute enforces the distinction structurally. The on_mutation hook fires before execution. The checkpoint system pauses at the mutation boundary. No code path exists where a mutation executes without passing through the gate.

The hook receives full context (what primitive, what arguments) and returns either None (approve) or a reason string (reject):

python
def approval_policy(ctx: MutationHookContext) -> str | None:
    if ctx.method_name == "delete_document":
        return "Document deletion requires manual approval"
    return None  # Allow other mutations

This is the human-in-the-loop pattern, implemented as code, not as a prompt.

Decision 9: Plans Can Be Retried#

When a plan fails validation, PlanExecute can retry with feedback:

text
Attempt 1:  LLM generates plan → validation fails ("for loop not allowed")

            feedback = "for loop not allowed"

Attempt 2:  LLM generates plan with feedback → validation passes → execute

Why retry instead of failing immediately?

Because LLMs are stochastic. A model might produce a valid plan 90% of the time. The 10% failure case is often a trivial fix: the model used a loop when it should have used a different approach, or prefixed a call with self. out of habit. Feeding the validation error back as feedback usually fixes it in one retry.

Each attempt is recorded as a PlanAttempt with the full prompt, response, and validation error. This gives you visibility into the planning process, not just the final plan.

Decision 10: Multi-Turn State Persists in Variables#

PlanExecute supports multi-turn conversations where variables persist between runs:

python
config = PlanExecuteConfig(multi_turn=True)
agent = Calculator(llm=llm, config=config)

# Turn 1: result = add(2, 3)  → 5
r1 = agent.run("Add 2 and 3")

# Turn 2: final = multiply(result, 10)  → 50
# 'result' from turn 1 is available in the namespace
r2 = agent.run("Multiply the result by 10")

Why persist state in variables instead of in the prompt?

Because variables don't get re-tokenized. If turn 1 retrieves 10,000 tokens of documents, those documents exist as an object in memory. Turn 2 can reference them by variable name. The planner sees "you have a variable called documents", not the documents themselves.

Conversation history (the task, the plan, and whether it succeeded) is included in the prompt so the LLM has conversational context. But the data stays in variables. This is the Symbolic Firewall applied to conversation state.

Decision 11: Checkpoints Enable Distribution#

PlanExecute supports checkpoint-based execution for distributed systems:

python
for checkpoint in agent.execute_stepwise("process the data"):
    store.save(checkpoint)  # Persist to database

    if checkpoint.status == CheckpointStatus.AWAITING_APPROVAL:
        break  # Wait for human approval

# Later, possibly on a different machine:
checkpoint = store.load(checkpoint_id)
for checkpoint in agent.resume_from_checkpoint(checkpoint, approve_mutation=True):
    store.save(checkpoint)

Why build checkpointing into the execution engine?

Because real-world agent execution is rarely synchronous. A plan might take minutes. A mutation might need human approval from someone who isn't online. The server might restart mid-execution.

Because plans are code and state is a dictionary of variables, the full execution state can be serialized, stored, and resumed. The checkpoint captures:

  • Which step we're on
  • The full namespace (serialized)
  • All completed steps
  • Any pending mutation awaiting approval
  • Which worker is executing

This makes PlanExecute viable for production systems where "run to completion in one shot" isn't realistic.

The Extension Point: DesignExecute#

PlanExecute deliberately forbids loops, conditionals, and exception handling. This is a feature, not a limitation. Simple plans are easier to validate, trace, and debug.

But some problems genuinely need control flow. A shopping cart with a variable number of items needs a loop. Price thresholds need conditionals. For these cases, DesignExecute extends PlanExecute:

python
class ShoppingCart(DesignExecute):
    @primitive(read_only=True)
    def lookup_price(self, item: str) -> float: ...

    @primitive(read_only=True)
    def apply_discount(self, price: float, percent: float) -> float: ...

DesignExecute allows for, while, if/elif/else, and try/except, but injects loop guards via AST transformation to prevent infinite loops:

python
# LLM generates:
for item in items:
    process(item)

# DesignExecute transforms it to:
__loop_guard_1__ = 0
for item in items:
    __loop_guard_1__ += 1
    if __loop_guard_1__ > 1000:
        raise RuntimeError("Loop limit exceeded")
    process(item)

The hierarchy is deliberate:

PlanExecuteDesignExecute
AssignmentsYesYes
Function callsPrimitives onlyPrimitives only
LoopsNoYes (guarded)
ConditionalsNoYes
Try/exceptNoYes
ComplexityMinimalControlled

Start with PlanExecute. Move to DesignExecute only when you need control flow. The constraint is the feature.

Putting It All Together#

Here's a complete agent that demonstrates every concept:

python
from opensymbolicai import PlanExecute, primitive, decomposition

class ResearchAgent(PlanExecute):

    @primitive(read_only=True)
    def search(self, query: str, k: int = 5) -> list[dict]:
        """Search the knowledge base for relevant documents."""
        return self.vector_store.query(query, k=k)

    @primitive(read_only=True)
    def summarize(self, documents: list[dict], focus: str) -> str:
        """Summarize documents with a specific focus."""
        # The engineer decides what data reaches the LLM here
        return self.llm.summarize(documents, focus=focus)

    @primitive(read_only=False)
    def save_report(self, content: str, title: str) -> str:
        """Save a research report. Returns the report ID."""
        return self.db.save(title=title, content=content)

    @decomposition(
        intent="Research quantum computing and save a summary",
        expanded_intent="Search for documents, summarize with focus, save as report",
    )
    def _research_and_save(self) -> str:
        docs = self.search(query="quantum computing recent advances", k=8)
        summary = self.summarize(documents=docs, focus="practical applications")
        report_id = self.save_report(content=summary, title="Quantum Computing Report")
        return report_id

When agent.run("Research neural architecture search and save a report") is called:

  1. Plan: The planner LLM sees search, summarize, and save_report signatures plus the decomposition example. It generates:

    python
    docs = search(query="neural architecture search methods", k=8)
    summary = summarize(documents=docs, focus="recent breakthroughs")
    report_id = save_report(content=summary, title="NAS Report")
  2. Validate: The AST parser confirms: three assignments, three primitive calls, no disallowed syntax. Pass.

  3. Execute: The runtime evaluates each statement through the engineer's primitives. search returns documents that stay in the docs variable, never sent to the planner. summarize uses an LLM internally, but the engineer's code decides what data it sees and how. save_report triggers the mutation gate. If require_mutation_approval is set, execution pauses for approval. Every step is traced.

  4. Result: OrchestrationResult returns with success=True, result="report-id-123", the full execution trace, plan attempt history, and token metrics.

The planner was called once. Data reached the LLM only when the engineer's code sent it, inside summarize, not inside the agent loop. Every step is auditable. Mutations were gated. The plan is replayable.

That's why PlanExecute is what it is.


Read more: Behaviour Programming vs Tool Calling | LLM Attention Is Precious | Secure by Design

See the code: OpenSymbolicAI Core | Examples