TravelPlanner Benchmark: 97.9% on 1,000 Tasks Where GPT-4 Gets 0.6%

The TravelPlanner benchmark (ICML 2024) tests whether AI agents can produce realistic, constraint-satisfying travel itineraries. Even GPT-4 achieves only a 0.6% final pass rate on this benchmark.

OpenSymbolicAI achieves 100% on train, 99.4% on validation, and 97.9% on the full 1,000-task test set. Near-perfect scores on every commonsense and hard constraint check, zero errors, 100% delivery rate. We also ran a head-to-head framework comparison against LangChain and CrewAI, and tested 11 models across 4 providers to map which LLMs actually work for multi-constraint planning.

The results weren't close.

Full-Scale Results#

We ran every split: train (45 tasks), validation (180 tasks), and the full 1,000-task test set.

Split	Tasks	Delivery	Commonsense	Hard Constraints	Final Pass	Avg Time
Train	45	100%	100%	100%	100%	52.6s
Validation	180	100%	99.4%	100%	99.4%	55.5s
Test	1,000	100%	97.9%	100%	97.9%	52.4s

All hard constraint checks pass at 100% across all 1,225 tasks. The only misses are commonsense constraints: a handful of edge cases in city routing and restaurant diversity across 1,000 tasks.

vs Published Baselines#

Results from the TravelPlanner paper on the validation split:

Method	Delivery	Commonsense	Hard	Final Pass
GPT-3.5-Turbo	100%	2.9%	1.7%	0.6%
GPT-4	100%	6.4%	3.7%	0.6%
GPT-4-Turbo	99.4%	11.7%	4.6%	4.4%
Gemini 1.5 Pro	98.3%	7.8%	4.5%	3.9%
OpenSymbolicAI	100%	99.4%	100%	99.4%

The published baselines all use direct prompting or basic ReAct agents. The best published result (GPT-4-Turbo at 4.4%) is 23x lower than OpenSymbolicAI's 99.4% on the same split. The framework matters more than the model.

Framework Comparison: The Headline Numbers#

Three frameworks. Same model. Same tools. Same evaluation. Only the framework differs. 45 tasks from the train split (15 easy + 15 medium + 15 hard).

	OpenSymbolicAI	LangChain	CrewAI
Pass Rate	100%	77.8%	73.3%
Tokens / Task	13,936	43,801	81,331
LLM Calls / Task	2.3	13.5	39.6
Cost / Passing Task	$0.013	$0.051	$0.100
Avg Latency	47s	73s	124s

OpenSymbolicAI passes every task. The others don't, and they burn through more tokens trying.

Reliability Under Pressure#

Pass rates don't tell the full story. What matters is how they change as tasks get harder.

Pass rate by difficulty level across all three frameworks

OpenSymbolicAI stays at 100% regardless of difficulty. LangChain drops from 93.3% on easy tasks to 66.7% on hard tasks. CrewAI drops from 80% to 60%. The harder the task, the wider the gap.

On hard tasks (multi-city trips with budget, cuisine, room type, and transportation constraints), LangChain drops to 67% and CrewAI to 60%. OpenSymbolicAI doesn't miss one.

This is the pattern we described in LLM Attention Is Precious: as context grows, tool-calling frameworks lose track of their instructions. The more turns in the conversation, the more the original task gets buried. Complexity doesn't just make tasks harder. It makes frameworks forget what they were doing.

The Multiplier Effect#

The efficiency gap isn't incremental. It's multiplicative.

vs LangChain#

Dimension	OpenSymbolicAI	LangChain	Factor
Pass Rate	100%	77.8%	1.3x higher
Tokens / Task	13,936	43,801	3.1x fewer
LLM Calls / Task	2.3	13.5	5.9x fewer
Cost / Passing Task	$0.013	$0.051	4.1x cheaper
Latency	47s	73s	1.5x faster

vs CrewAI#

Dimension	OpenSymbolicAI	CrewAI	Factor
Pass Rate	100%	73.3%	1.4x higher
Tokens / Task	13,936	81,331	5.8x fewer
LLM Calls / Task	2.3	39.6	17x fewer
Cost / Passing Task	$0.013	$0.100	8x cheaper
Latency	47s	124s	2.6x faster

At scale, these factors compound. For 45 tasks, OpenSymbolicAI costs $0.56 total. LangChain costs $1.77. CrewAI costs $3.29. Multiply by thousands of production requests per day and the difference is material.

Where the Tokens Go#

The token breakdown tells the story of why the gap exists.

Token usage per task showing retrieval vs assembly breakdown

OpenSymbolicAI spends 3,546 tokens on retrieval and 10,390 on plan assembly. It gathers data in one shot, then spends the bulk of its budget on the part that matters: assembling a correct plan.

LangChain spends 40,822 tokens on retrieval and only 2,979 on assembly. CrewAI spends 74,733 on retrieval and 6,598 on assembly. Both frameworks burn most of their token budget just gathering data, not doing productive work.

OpenSymbolicAI's retrieval phase is 11.5x more efficient than LangChain's and 21x more efficient than CrewAI's. It gathers the same data with a fraction of the tokens because one LLM call generates Python code that makes all required API calls in parallel. No multi-turn ReAct loops. No agent-to-agent handoffs.

Why the Gap Exists#

The three frameworks use different patterns for the same task.

Architecture comparison: code generation vs ReAct vs multi-agent

OpenSymbolicAI uses code generation. One LLM call produces Python that makes all 15 API calls at once. A second call assembles the plan from the retrieved data. Total: 2-3 LLM calls. Failure mode: deterministic. Code either runs or throws a traceable error.

LangChain uses a ReAct agent loop. The agent decides one tool call per turn, observes the result, then decides the next. It loops until it thinks it has enough data, then produces a plan. Total: 10-15 LLM calls. Failure mode: the agent stops early, missing data leads to incomplete plans.

CrewAI uses a multi-agent crew. A researcher agent iterates with tools, then hands off a text summary to a planner agent. Total: 25-40 LLM calls. Failure mode: the researcher misses tools entirely, and the planner hallucinates entity names from the incomplete text.

The core insight: one LLM call generating code that makes 15 API calls is more efficient than 15 separate LLM turns each making one API call. The ReAct pattern and multi-agent delegation both pay a per-turn overhead that compounds with task complexity. This is behaviour programming in action.

Constraint Satisfaction#

The TravelPlanner benchmark evaluates 13 individual constraints across two categories. OpenSymbolicAI achieves perfect scores on every single one.

Category	OpenSymbolicAI	LangChain	CrewAI
Commonsense (8 checks), macro	100%	86.7%	82.2%
Commonsense (8 checks), micro avg	100%	98.1%	97.8%
Hard Constraints (5 checks), macro	100%	91.1%	91.1%
Hard Constraints (5 checks), micro avg	100%	95.6%	95.6%

LangChain and CrewAI achieve high micro-averages (most individual checks pass), but their macro rates drop, meaning when a plan fails it often fails multiple checks at once. This is the cascade effect of incomplete data retrieval: a single missing restaurant search leads to hallucinated entity names, which fails both the "within sandbox" and "diverse restaurants" checks.

OpenSymbolicAI's code generation approach avoids this entirely. The generated Python code either retrieves all required data or throws an error. There's no partial success. No silent omissions.

Model Landscape: Which LLMs Work?#

We ran 11 models across 4 providers on the hardest TravelPlanner tasks. No code changes, no prompt tweaks, no per-model tuning. Swap the model name in the config and run.

Model	Pass Rate	Cost/Task	Latency	LLM Calls
Llama 3.3 70B (Groq)	100%	$0.006	4.3s	2.1
GPT-OSS-120B (Fireworks)	100%	$0.013	47.6s	2.5
GPT-4.1 Mini (OpenAI)	100%	$0.014	93.2s	3.7
GPT-4.1 (OpenAI)	100%	$0.024	10.9s	3.4
GPT-4o (OpenAI)	100%	$0.026	6.2s	2.2
Kimi K2.5	100%	$0.035	224.0s	3.1
Claude Sonnet 4 (Anthropic)	100%	$0.043	19.7s	2.0
Llama 4 Scout	93.3%	$0.003	7.3s	2.1
Mixtral 8x22B	53.3%	$0.014	160.3s	7.8
Qwen3 32B	20.0%	$0.009	593.0s	27.2
GPT-OSS-20B	13.3%	$0.011	519.0s	29.7

Seven models pass everything. Four don't. The line between them is sharper than you'd expect.

It's Not About Code Generation#

The models that fail aren't bad at code. OpenSymbolicAI's calculator benchmark runs Qwen3 1.7B, a model 40x smaller, and it hits 100% on 120 math tasks. Small models write correct Python when the task fits in their head.

TravelPlanner is different. A hard task means searching flights across multiple cities and dates, finding restaurants matching cuisines, tracking a running budget, and assembling a day-by-day plan that respects transportation constraints. The model needs to hold all constraints in working memory while writing 30-50 lines of Python. That's where smaller models fall apart. Not in code generation, but in sustained multi-constraint reasoning.

Qwen3 32B and GPT-OSS-20B average nearly 30 LLM calls per task. They spend 10 minutes going in circles on something Llama 3.3 70B solves in 4 seconds.

The Sweet Spot#

Model	Provider	Cost/Task	Latency	Notes
Llama 3.3 70B	Groq	$0.006	4.3s	Cheapest, fastest, 100%
GPT-OSS-120B	Fireworks	$0.013	47.6s	Solid default
GPT-4.1 Mini	OpenAI	$0.014	93.2s	Passes everything but slower

Llama 3.3 70B on Groq is the cheapest model tested, the fastest by a wide margin, and it passes every hard task. Groq's LPU hardware makes 70B inference very fast: 4.3 seconds for a multi-constraint travel plan that takes GPT-4o six seconds and Claude Sonnet twenty.

Frontier models earn their price on open-ended tasks where the model needs to handle ambiguity. When the framework provides structure (clear tool APIs, explicit constraints, deterministic execution), that extra reasoning capacity goes unused. Once a model clears the complexity threshold for your task, you're choosing between infrastructure, not intelligence levels.

The Problem: TravelPlanner#

TravelPlanner is a benchmark introduced at ICML 2024 by the OSU NLP Group. It tests whether AI agents can produce realistic, constraint-satisfying travel itineraries, not just plausible-sounding ones.

Each task specifies an origin city, destination cities, travel dates, and a set of constraints: budget limits, cuisine preferences, room types, transportation modes. The agent must search for flights, restaurants, hotels, and attractions, then assemble a day-by-day plan that satisfies all constraints simultaneously.

The evaluation is strict: 13 constraint checks, split between 8 commonsense (reasonable meal times, valid transportation between cities, etc.) and 5 hard constraints (budget, room type, cuisine diversity, etc.). A plan passes only if it satisfies every check. No partial credit.

This makes TravelPlanner a test of agent reliability, not just capability. Any framework can produce a plausible travel plan. The question is whether it produces a correct one, every time, even on hard tasks with many constraints.

Methodology#

Full-Scale Benchmark#

Model: gpt-oss-120b via Fireworks AI
Dataset: All TravelPlanner splits: train (45), validation (180), test (1,000)
Evaluation: Official TravelPlanner constraint checks (8 commonsense + 5 hard)
Error handling: Rate-limited tasks (HTTP 429) excluded from averages; every non-error task for frontier models passed

Framework Comparison#

Model: gpt-oss-120b via Fireworks AI (same for all frameworks)
Dataset: Full TravelPlanner train split, 45 tasks (15 easy + 15 medium + 15 hard)
Shared infrastructure: All frameworks use the same ReferenceDatabase, search primitives, and evaluation pipeline
Post-processing: All frameworks share the same deterministic field-filling step. The comparison isolates the framework's LLM interaction pattern
Parallelism: 10 concurrent workers per framework

Model Landscape#

Models: 11 models across Fireworks AI, Groq, Anthropic, and OpenAI
Dataset: Hard difficulty, 15 tasks from train split
Cost: Calculated from actual token counts at published API pricing

Reproduce It#

bash

# Framework comparison
uv sync --extra langchain --extra crewai
uv run travelplanner-compare \
    --frameworks opensymbolicai,langchain,crewai \
    --model gpt-oss-120b --provider fireworks \
    --split train -p 10

# Full benchmark (all 1,000 test tasks)
uv run travelplanner-bench \
    --model gpt-oss-120b --provider fireworks \
    --split test --parallel 5

# Model landscape (swap provider and model)
uv run travelplanner-bench \
    --model llama-3.3-70b --provider groq \
    --level hard --split train

The Bottom Line#

The TravelPlanner results aren't surprising if you've followed the argument through this blog:

LLM attention is precious: every additional turn dilutes the model's focus on the original instructions.
Behaviour programming beats tool calling: code generation eliminates the per-turn overhead of ReAct loops.
The prompt spectrum matters: code plans are testable, debuggable, and composable in ways English and specs cannot match.

The benchmark puts numbers on what the architecture predicts. When you eliminate unnecessary LLM turns, you get better results with fewer tokens. When your execution plan is code, failures are traceable and fixable. When retrieval is a single parallelized code block instead of a sequential agent loop, you gather more data more reliably.

97.9% on 1,000 tasks where GPT-4 gets 0.6%. Seven models pass at 100%, including a $0.006/task open-weight option. 6x fewer tokens and 8x cheaper than LangChain and CrewAI. The framework matters more than the model.

Request early access to the runtime and benchmark suite.