TravelPlanner Benchmark: 97.9% on 1,000 Tasks Where GPT-4 Gets 0.6%
OpenSymbolicAI achieves near-perfect scores on all 1,225 TravelPlanner tasks, outperforms LangChain and CrewAI head-to-head, and maps the model landscape across 11 LLMs and 4 providers.
The TravelPlanner benchmark (ICML 2024) tests whether AI agents can produce realistic, constraint-satisfying travel itineraries. Even GPT-4 achieves only a 0.6% final pass rate on this benchmark.
OpenSymbolicAI achieves 100% on train, 99.4% on validation, and 97.9% on the full 1,000-task test set. Near-perfect scores on every commonsense and hard constraint check, zero errors, 100% delivery rate. We also ran a head-to-head framework comparison against LangChain and CrewAI, and tested 11 models across 4 providers to map which LLMs actually work for multi-constraint planning.
The results weren't close.
Full-Scale Results#
We ran every split: train (45 tasks), validation (180 tasks), and the full 1,000-task test set.
| Split | Tasks | Delivery | Commonsense | Hard Constraints | Final Pass | Avg Time |
|---|---|---|---|---|---|---|
| Train | 45 | 100% | 100% | 100% | 100% | 52.6s |
| Validation | 180 | 100% | 99.4% | 100% | 99.4% | 55.5s |
| Test | 1,000 | 100% | 97.9% | 100% | 97.9% | 52.4s |
All hard constraint checks pass at 100% across all 1,225 tasks. The only misses are commonsense constraints: a handful of edge cases in city routing and restaurant diversity across 1,000 tasks.
vs Published Baselines#
Results from the TravelPlanner paper on the validation split:
| Method | Delivery | Commonsense | Hard | Final Pass |
|---|---|---|---|---|
| GPT-3.5-Turbo | 100% | 2.9% | 1.7% | 0.6% |
| GPT-4 | 100% | 6.4% | 3.7% | 0.6% |
| GPT-4-Turbo | 99.4% | 11.7% | 4.6% | 4.4% |
| Gemini 1.5 Pro | 98.3% | 7.8% | 4.5% | 3.9% |
| OpenSymbolicAI | 100% | 99.4% | 100% | 99.4% |
The published baselines all use direct prompting or basic ReAct agents. The best published result (GPT-4-Turbo at 4.4%) is 23x lower than OpenSymbolicAI's 99.4% on the same split. The framework matters more than the model.
Framework Comparison: The Headline Numbers#
Three frameworks. Same model. Same tools. Same evaluation. Only the framework differs. 45 tasks from the train split (15 easy + 15 medium + 15 hard).
| OpenSymbolicAI | LangChain | CrewAI | |
|---|---|---|---|
| Pass Rate | 100% | 77.8% | 73.3% |
| Tokens / Task | 13,936 | 43,801 | 81,331 |
| LLM Calls / Task | 2.3 | 13.5 | 39.6 |
| Cost / Passing Task | $0.013 | $0.051 | $0.100 |
| Avg Latency | 47s | 73s | 124s |
OpenSymbolicAI passes every task. The others don't, and they burn through more tokens trying.
Reliability Under Pressure#
Pass rates don't tell the full story. What matters is how they change as tasks get harder.
OpenSymbolicAI stays at 100% regardless of difficulty. LangChain drops from 93.3% on easy tasks to 66.7% on hard tasks. CrewAI drops from 80% to 60%. The harder the task, the wider the gap.
On hard tasks (multi-city trips with budget, cuisine, room type, and transportation constraints), LangChain drops to 67% and CrewAI to 60%. OpenSymbolicAI doesn't miss one.
This is the pattern we described in LLM Attention Is Precious: as context grows, tool-calling frameworks lose track of their instructions. The more turns in the conversation, the more the original task gets buried. Complexity doesn't just make tasks harder. It makes frameworks forget what they were doing.
The Multiplier Effect#
The efficiency gap isn't incremental. It's multiplicative.
vs LangChain#
| Dimension | OpenSymbolicAI | LangChain | Factor |
|---|---|---|---|
| Pass Rate | 100% | 77.8% | 1.3x higher |
| Tokens / Task | 13,936 | 43,801 | 3.1x fewer |
| LLM Calls / Task | 2.3 | 13.5 | 5.9x fewer |
| Cost / Passing Task | $0.013 | $0.051 | 4.1x cheaper |
| Latency | 47s | 73s | 1.5x faster |
vs CrewAI#
| Dimension | OpenSymbolicAI | CrewAI | Factor |
|---|---|---|---|
| Pass Rate | 100% | 73.3% | 1.4x higher |
| Tokens / Task | 13,936 | 81,331 | 5.8x fewer |
| LLM Calls / Task | 2.3 | 39.6 | 17x fewer |
| Cost / Passing Task | $0.013 | $0.100 | 8x cheaper |
| Latency | 47s | 124s | 2.6x faster |
At scale, these factors compound. For 45 tasks, OpenSymbolicAI costs $0.56 total. LangChain costs $1.77. CrewAI costs $3.29. Multiply by thousands of production requests per day and the difference is material.
Where the Tokens Go#
The token breakdown tells the story of why the gap exists.
OpenSymbolicAI spends 3,546 tokens on retrieval and 10,390 on plan assembly. It gathers data in one shot, then spends the bulk of its budget on the part that matters: assembling a correct plan.
LangChain spends 40,822 tokens on retrieval and only 2,979 on assembly. CrewAI spends 74,733 on retrieval and 6,598 on assembly. Both frameworks burn most of their token budget just gathering data, not doing productive work.
OpenSymbolicAI's retrieval phase is 11.5x more efficient than LangChain's and 21x more efficient than CrewAI's. It gathers the same data with a fraction of the tokens because one LLM call generates Python code that makes all required API calls in parallel. No multi-turn ReAct loops. No agent-to-agent handoffs.
Why the Gap Exists#
The three frameworks use different patterns for the same task.
OpenSymbolicAI uses code generation. One LLM call produces Python that makes all 15 API calls at once. A second call assembles the plan from the retrieved data. Total: 2-3 LLM calls. Failure mode: deterministic. Code either runs or throws a traceable error.
LangChain uses a ReAct agent loop. The agent decides one tool call per turn, observes the result, then decides the next. It loops until it thinks it has enough data, then produces a plan. Total: 10-15 LLM calls. Failure mode: the agent stops early, missing data leads to incomplete plans.
CrewAI uses a multi-agent crew. A researcher agent iterates with tools, then hands off a text summary to a planner agent. Total: 25-40 LLM calls. Failure mode: the researcher misses tools entirely, and the planner hallucinates entity names from the incomplete text.
The core insight: one LLM call generating code that makes 15 API calls is more efficient than 15 separate LLM turns each making one API call. The ReAct pattern and multi-agent delegation both pay a per-turn overhead that compounds with task complexity. This is behaviour programming in action.
Constraint Satisfaction#
The TravelPlanner benchmark evaluates 13 individual constraints across two categories. OpenSymbolicAI achieves perfect scores on every single one.
| Category | OpenSymbolicAI | LangChain | CrewAI |
|---|---|---|---|
| Commonsense (8 checks), macro | 100% | 86.7% | 82.2% |
| Commonsense (8 checks), micro avg | 100% | 98.1% | 97.8% |
| Hard Constraints (5 checks), macro | 100% | 91.1% | 91.1% |
| Hard Constraints (5 checks), micro avg | 100% | 95.6% | 95.6% |
LangChain and CrewAI achieve high micro-averages (most individual checks pass), but their macro rates drop, meaning when a plan fails it often fails multiple checks at once. This is the cascade effect of incomplete data retrieval: a single missing restaurant search leads to hallucinated entity names, which fails both the "within sandbox" and "diverse restaurants" checks.
OpenSymbolicAI's code generation approach avoids this entirely. The generated Python code either retrieves all required data or throws an error. There's no partial success. No silent omissions.
Model Landscape: Which LLMs Work?#
We ran 11 models across 4 providers on the hardest TravelPlanner tasks. No code changes, no prompt tweaks, no per-model tuning. Swap the model name in the config and run.
| Model | Pass Rate | Cost/Task | Latency | LLM Calls |
|---|---|---|---|---|
| Llama 3.3 70B (Groq) | 100% | $0.006 | 4.3s | 2.1 |
| GPT-OSS-120B (Fireworks) | 100% | $0.013 | 47.6s | 2.5 |
| GPT-4.1 Mini (OpenAI) | 100% | $0.014 | 93.2s | 3.7 |
| GPT-4.1 (OpenAI) | 100% | $0.024 | 10.9s | 3.4 |
| GPT-4o (OpenAI) | 100% | $0.026 | 6.2s | 2.2 |
| Kimi K2.5 | 100% | $0.035 | 224.0s | 3.1 |
| Claude Sonnet 4 (Anthropic) | 100% | $0.043 | 19.7s | 2.0 |
| Llama 4 Scout | 93.3% | $0.003 | 7.3s | 2.1 |
| Mixtral 8x22B | 53.3% | $0.014 | 160.3s | 7.8 |
| Qwen3 32B | 20.0% | $0.009 | 593.0s | 27.2 |
| GPT-OSS-20B | 13.3% | $0.011 | 519.0s | 29.7 |
Seven models pass everything. Four don't. The line between them is sharper than you'd expect.
It's Not About Code Generation#
The models that fail aren't bad at code. OpenSymbolicAI's calculator benchmark runs Qwen3 1.7B, a model 40x smaller, and it hits 100% on 120 math tasks. Small models write correct Python when the task fits in their head.
TravelPlanner is different. A hard task means searching flights across multiple cities and dates, finding restaurants matching cuisines, tracking a running budget, and assembling a day-by-day plan that respects transportation constraints. The model needs to hold all constraints in working memory while writing 30-50 lines of Python. That's where smaller models fall apart. Not in code generation, but in sustained multi-constraint reasoning.
Qwen3 32B and GPT-OSS-20B average nearly 30 LLM calls per task. They spend 10 minutes going in circles on something Llama 3.3 70B solves in 4 seconds.
The Sweet Spot#
| Model | Provider | Cost/Task | Latency | Notes |
|---|---|---|---|---|
| Llama 3.3 70B | Groq | $0.006 | 4.3s | Cheapest, fastest, 100% |
| GPT-OSS-120B | Fireworks | $0.013 | 47.6s | Solid default |
| GPT-4.1 Mini | OpenAI | $0.014 | 93.2s | Passes everything but slower |
Llama 3.3 70B on Groq is the cheapest model tested, the fastest by a wide margin, and it passes every hard task. Groq's LPU hardware makes 70B inference very fast: 4.3 seconds for a multi-constraint travel plan that takes GPT-4o six seconds and Claude Sonnet twenty.
Frontier models earn their price on open-ended tasks where the model needs to handle ambiguity. When the framework provides structure (clear tool APIs, explicit constraints, deterministic execution), that extra reasoning capacity goes unused. Once a model clears the complexity threshold for your task, you're choosing between infrastructure, not intelligence levels.
The Problem: TravelPlanner#
TravelPlanner is a benchmark introduced at ICML 2024 by the OSU NLP Group. It tests whether AI agents can produce realistic, constraint-satisfying travel itineraries, not just plausible-sounding ones.
Each task specifies an origin city, destination cities, travel dates, and a set of constraints: budget limits, cuisine preferences, room types, transportation modes. The agent must search for flights, restaurants, hotels, and attractions, then assemble a day-by-day plan that satisfies all constraints simultaneously.
The evaluation is strict: 13 constraint checks, split between 8 commonsense (reasonable meal times, valid transportation between cities, etc.) and 5 hard constraints (budget, room type, cuisine diversity, etc.). A plan passes only if it satisfies every check. No partial credit.
This makes TravelPlanner a test of agent reliability, not just capability. Any framework can produce a plausible travel plan. The question is whether it produces a correct one, every time, even on hard tasks with many constraints.
Methodology#
Full-Scale Benchmark#
- Model:
gpt-oss-120bvia Fireworks AI - Dataset: All TravelPlanner splits: train (45), validation (180), test (1,000)
- Evaluation: Official TravelPlanner constraint checks (8 commonsense + 5 hard)
- Error handling: Rate-limited tasks (HTTP 429) excluded from averages; every non-error task for frontier models passed
Framework Comparison#
- Model:
gpt-oss-120bvia Fireworks AI (same for all frameworks) - Dataset: Full TravelPlanner train split, 45 tasks (15 easy + 15 medium + 15 hard)
- Shared infrastructure: All frameworks use the same
ReferenceDatabase, search primitives, and evaluation pipeline - Post-processing: All frameworks share the same deterministic field-filling step. The comparison isolates the framework's LLM interaction pattern
- Parallelism: 10 concurrent workers per framework
Model Landscape#
- Models: 11 models across Fireworks AI, Groq, Anthropic, and OpenAI
- Dataset: Hard difficulty, 15 tasks from train split
- Cost: Calculated from actual token counts at published API pricing
Reproduce It#
# Framework comparison
uv sync --extra langchain --extra crewai
uv run travelplanner-compare \
--frameworks opensymbolicai,langchain,crewai \
--model gpt-oss-120b --provider fireworks \
--split train -p 10
# Full benchmark (all 1,000 test tasks)
uv run travelplanner-bench \
--model gpt-oss-120b --provider fireworks \
--split test --parallel 5
# Model landscape (swap provider and model)
uv run travelplanner-bench \
--model llama-3.3-70b --provider groq \
--level hard --split trainThe Bottom Line#
The TravelPlanner results aren't surprising if you've followed the argument through this blog:
- LLM attention is precious: every additional turn dilutes the model's focus on the original instructions.
- Behaviour programming beats tool calling: code generation eliminates the per-turn overhead of ReAct loops.
- The prompt spectrum matters: code plans are testable, debuggable, and composable in ways English and specs cannot match.
The benchmark puts numbers on what the architecture predicts. When you eliminate unnecessary LLM turns, you get better results with fewer tokens. When your execution plan is code, failures are traceable and fixable. When retrieval is a single parallelized code block instead of a sequential agent loop, you gather more data more reliably.
97.9% on 1,000 tasks where GPT-4 gets 0.6%. Seven models pass at 100%, including a $0.006/task open-source option. 6x fewer tokens and 8x cheaper than LangChain and CrewAI. The framework matters more than the model.
Read more: Behaviour Programming vs Tool Calling | LLM Attention Is Precious | The Anatomy of PlanExecute
See the code: OpenSymbolicAI Core | Benchmark Source