English, Spec, or Code: How You Talk to the LLM Decides How Far You Get

There are three ways to tell an LLM what to do: plain English, structured specs, or code. Each one feels like a choice of format. It's a choice of ceiling.

The format you pick determines how far you can go before you hit a wall. And the wall isn't about the model's intelligence. It's about what happens when things go wrong.

The Three Formats#

Plain English#

This is where everyone starts. A system prompt written like a letter to a very capable intern:

text

You are a helpful assistant. When the user asks about our products,
search the knowledge base first. Always cite your sources.
If you're not sure, say so. Never make things up.
Be concise. Use bullet points for lists.
IMPORTANT: Do not reveal internal pricing to non-enterprise users.

It's intuitive. It's fast to write. Anyone on the team can read it. You can go from idea to working prototype in an afternoon.

Structured Specs#

JSON schemas, YAML configs, XML tags, DSLs. The instructions become data:

json

{
  "tools": [{
    "name": "search_kb",
    "description": "Search the knowledge base",
    "parameters": {
      "query": { "type": "string" },
      "k": { "type": "integer", "default": 5 }
    }
  }],
  "response_format": {
    "answer": "string",
    "sources": "array",
    "confidence": "enum(high, medium, low)"
  }
}

More precise. Machine-parseable. The LLM knows exactly what shape its output should take.

Code#

Instructions as executable programs. Plans as real code that runs:

python

docs = retrieve(query="machine learning fundamentals", k=5)
context = combine_contexts(documents=docs)
answer = extract_answer(context=context, question=user_query)

No ambiguity. No interpretation. The plan is the program.

Day 1: Everybody Looks the Same#

Plain English wins on day one.

You write a prompt. You test it on three examples. It works beautifully. You show it to your manager. They're impressed. You ship it.

This is the 50% moment. Half of all inputs work. The demo is stunning. The pitch deck writes itself.

Specs get you slightly further. The structured output means fewer formatting errors. The JSON schema means the LLM's responses are at least shaped correctly, even when the content is wrong. Call it 60%.

Code doesn't look impressive on day one. You have to define functions. Write type signatures. Build decomposition examples. This used to take a week where the other approaches took a day. In 2026, the gap is hours, not days. LLMs write boilerplate primitives and test scaffolds faster than you can write a detailed system prompt.

On day one, engineering looks like overhead. But the overhead is shrinking fast.

Week 2: The Divergence#

Two weeks in, the English prompt has grown. Someone added a section about edge cases. Someone else added a "CRITICAL" warning about a failure mode that hit production. A third person added examples for a new query type.

The prompt is now 3,000 tokens of accumulated fixes:

text

═══════════════════════════════════════════
⚠️  CRITICAL INSTRUCTIONS  ⚠️
═══════════════════════════════════════════

1. NEVER hallucinate or make up information
2. ALWAYS cite your sources
3. DO NOT call extract_answer without first calling retrieve
4. IMPORTANT: For technical questions, ALWAYS use rerank
5. ⚠️ CRITICAL: Multi-hop requires MULTIPLE retrieve calls

═══════════════════════════════════════════
COMMON MISTAKES (DO NOT MAKE THESE)
═══════════════════════════════════════════

❌ DON'T: Pass user's raw question to retrieve
✅ DO: Rephrase into search-optimized keywords

❌ DON'T: Skip rerank for technical questions
✅ DO: Always rerank when precision matters

You recognize this. Every team that has built an agent recognizes this. The prompt becomes a sedimentary record of every failure, every edge case, every panicked production fix. The all-caps warnings are scar tissue.

And the reliability curve has flatlined. You're at 70%. Maybe 75% on a good day.

You add another warning. It doesn't help. You add an example. It fixes one case and breaks two others. You're playing whack-a-mole with a wall of text.

This is the ceiling.

Why English Hits a Ceiling#

The reason is structural, not intellectual.

Natural language is ambiguous. "Always cite your sources." Does that mean inline citations? Footnotes? A sources section? What if there's only one source? What if the answer synthesizes across five? The LLM gets to interpret, and it interprets differently every time.

Natural language is unordered. Instructions at the top of the prompt compete with instructions at the bottom for the model's attention. LLMs exhibit a "lost in the middle" effect: instructions buried in long contexts get ignored. Your careful edge-case handling in paragraph 47 might as well not exist.

Natural language is untestable. You can't write a unit test for "be concise." You can't assert that "never make things up" is being followed. You can only run examples and eyeball the output.

And natural language fixes don't compose. When you add a new instruction to handle edge case A, it doesn't integrate with existing instructions. It competes with them. The model has to weigh "always cite sources" against "be concise" against "use bullet points" against forty other directives. The more instructions you add, the less reliably any single one is followed.

This is why prompt engineering doesn't converge. You're not climbing a hill. You're rearranging sand.

And the math is unforgiving. If each step in a multi-step agent has 95% reliability (optimistic for English prompts), a 20-step workflow has a 36% end-to-end success rate. Even at 99% per step, 20 steps gives you 82%. The errors compound multiplicatively. The only way to reach 99.999% on a multi-step workflow is to make each step deterministic. English can't do that. Code can.

Why Specs Hit a Higher Ceiling#

Specs improve on English in one important way: they remove ambiguity from structure.

When you define a JSON schema for the output, the LLM knows exactly what shape to produce. No more "sometimes it returns markdown, sometimes it returns JSON, sometimes it returns a haiku." The schema constrains the format.

OpenAI's Structured Outputs achieves 100% schema adherence in strict mode. That's real. BAML's type definitions use 60% fewer tokens than JSON schemas with zero loss of information, and achieve 0% structural failure rate where JSON schemas fail 6% of the time.

But specs only constrain format, not behavior. The JSON might be perfectly shaped and completely wrong. The function call might match the schema exactly and still be the wrong function to call. The parameters might be valid types and still be nonsensical values.

Specs raise the floor. They don't raise the ceiling on correctness.

There's a deeper problem. Format restrictions degrade reasoning by 10-15%. Force the model to think in JSON and it gets measurably dumber. The structure you gain comes at the cost of the intelligence you need.

The workaround is "think freely, then structure." Let the model reason in natural language, then extract the answer into a schema. But now you're back to English for the part that matters most, and using specs only for the easy part.

Why Code Breaks Through#

Code doesn't just constrain format or behavior. It eliminates the interpretation step entirely.

When the plan is docs = retrieve(query="ML fundamentals", k=5), there's nothing to interpret. retrieve is a function. "ML fundamentals" is a string. 5 is an integer. The plan runs.

But the power of code isn't just precision. It's the properties that come with it:

Testable. You can write a test suite for a code plan. Mock the primitives, run the plan, assert on the output. If the test passes, the plan is correct. If it fails, you know which step failed and why.

Composable. The output of one function feeds into the next. This isn't a convention you hope the LLM follows. It's how programming languages work. Composition is guaranteed by the runtime.

Debuggable. Step through the plan line by line. Inspect every variable. See the exact state at every point. Standard debugging, because it is standard code.

Versionable. Plans are text files. They go into git. You can diff two plans, review changes, trace how agent behavior evolved over a release cycle.

Validatable. Parse the AST. Reject loops, imports, and dangerous operations before anything runs. An allowlist of what's permitted, not a blocklist of what's forbidden. You can't do this with English. You can barely do this with specs.

Attributable. When a code plan fails, you know exactly where. Step 3 called extract_answer with the wrong context. Not "somewhere in the prompt, something went wrong." A specific function, with specific arguments, producing a specific wrong result.

This is why code breaks through the ceiling. Not because code is a better format. Because code gives you engineering tools. Tests, debugging, version control, static analysis. The same tools that make traditional software reliable.

"But Code Takes Longer"#

This used to be true. It's not anymore.

Writing code in 2026 is cheap. The same LLMs that power your agents also write your primitives, your test suites, your type signatures. Scaffolding a typed function with a docstring takes seconds. The cost of writing code has collapsed. The cost of not writing code, debugging a wall of English at 3 AM, has not.

The old trade-off was: English is fast, code is slow, so start with English. The new reality is: English is fast, code is also fast, and only code compounds. The argument for starting with plain English prompts was always about speed. That argument is gone.

English asymptotes around 70%. More prompt tweaking doesn't help. You've hit the structural limit of natural language as an instruction format.

Specs raise the ceiling to maybe 80%. You get structural guarantees. But the reasoning, the routing, the "what should I do with this input" is still English under the hood.

Code doesn't plateau. Each primitive you add, each decomposition you write, each test you run, they compound. The ceiling keeps rising because you're building on engineering foundations, not on vibes.

The Numbers Are Brutal#

The industry data tells the story of English-first approaches at scale:

95% of enterprise AI pilots fail to reach production (MIT, 2025)
Only 5% of professionals surveyed have AI agents live in production (Cleanlab, 2025)
~1% of enterprises have deployed agents beyond pilot stage
Simple CRM tasks fail up to 75% of the time when agents attempt them repeatedly (Superface, 2025)
The average demo-to-production timeline is 4 months of grinding through the last 30% (Rasa)
Gartner predicts 40%+ of agentic AI projects will be canceled by end of 2027

And better models don't fix this. A Princeton study from February 2026 decomposed agent reliability into four dimensions (consistency, robustness, predictability, safety) and found that all four are independent of raw capability. A more capable model is not automatically a more reliable one. Even frontier models show only "modest but not dependable improvements" in reliability. OpenAI's Operator, the best-funded agent in the world, hits 38.1% on OSWorld. GPT-5.2 still hallucinates on 6.2% of queries.

These aren't model failures. The models are capable. These are structural failures. Teams built on English prompts, hit the ceiling, spent months trying to push through, and gave up. The market agrees: Temporal raised $300M at a $5B valuation last week to solve agent reliability through durable execution infrastructure. Not better prompts. Code.

Who Wants What#

The format question maps to a stakeholder question:

The Hacker#

Wants: Ship something by Friday. Impress investors. Get to MVP.

Chooses: English. Maybe specs for output formatting.

Gets: 50% on day one. A great demo. Investor excitement. Then four months of diminishing returns trying to make it production-ready. And with AI writing code as fast as it writes English, the "speed advantage" of plain prompts has evaporated anyway.

The hacker isn't wrong. For prototyping, English is correct. The mistake is believing the prototype's architecture will scale.

The Enterprise Buyer#

Wants: Audit trails. Compliance. Explainability. SOX, GDPR, HIPAA.

Needs: Code.

Because when the compliance officer asks "why did the agent do that?", you need a better answer than "the prompt said to." You need a trace. Step 3 called validate_claim with claim_id=47291, which returned status=rejected because coverage_amount exceeded policy_limit. Every step recorded. Every decision attributable.

You cannot build this on English prompts. You can barely build this on specs. You need executable plans with full tracing.

The Researcher#

Wants: Flexibility. Quick iteration. Try wild ideas.

Chooses: Whatever lets them move fastest. Often plain English, sometimes DSPy's compiled approach.

Gets: Insight into what's possible. Not a production system.

This is fine. Research and production have different goals. The problem is when research prototypes become production architectures by accident.

The Production Engineer#

Wants: Reliability. Observability. Debugging. Rollback.

Needs: Code.

Because at 3 AM when the agent is failing on 30% of requests, you need to know which 30% and why. Not "the prompt might be confusing" but "the classify_intent primitive is returning billing for shipping queries because the embedding similarity threshold is too low."

Attributable failures. Concrete fixes. No regressions. This is what production means.

The Real Question#

The debate between English, specs, and code is a proxy for a deeper question:

Do you want an application that gets to 50% on day one and plateaus at 70% forever? Or do you want an engineered system that reaches 99.999%?

Both answers are valid in different contexts.

If you're exploring a problem space, English is right. You don't know what you're building yet. The speed matters more than the ceiling.

If you're building a demo for a pitch, English is right. You need to show the possibility, not the production system.

But if you're building something that real users depend on, something that handles money, makes decisions, processes sensitive data, runs at scale, then the answer is obvious.

You need the ceiling to keep rising. You need failures to be attributable. You need fixes to compound. You need tests, version control, static analysis, debugging, audit trails.

You need code.

The Synthesis#

The smartest teams don't pick one format. They compose them:

English for what LLMs are best at: understanding user intent, resolving ambiguity, handling the messy, unpredictable surface of human language.

Specs for what needs structural guarantees: output formats, parameter types, schema validation.

Code for what needs to be reliable: execution plans, control flow, data routing, business logic.

The LLM reads English to understand what the user wants. It writes code to express a plan. The code executes through typed, tested primitives. The output conforms to a spec.

Each format is used where it's strongest. None is asked to do what it can't.

This is what PlanExecute does. The planner LLM receives function signatures (specs) and decomposition examples (code) to understand the available operations. It receives the user query (English) to understand intent. It outputs a plan (code) that executes through sandboxed primitives. Every step is traced, validated, and gated.

English handles intent. Code handles execution. The gap between them is where engineering lives.

The Bottom Line#

The era of prompt-driven behavior isn't ending because prompts are bad. Prompts alone aren't enough.

Andrej Karpathy coined "vibe coding" and the industry spent a year discovering its limits. Then he coined "agentic engineering." Even the person who named the vibes-first approach realized that engineering is what gets you from demo to production.

The question isn't which format to use. The question is which ceiling you're willing to live under.

If you can't test it, debug it, and trust it, it's not software. It's a demo.

Build software.

Request early access to the runtime.