Change Everything, Change Nothing: MultiHopRAG in Python and C#
We swapped the language, the vector store, the code executor, and the type system. Accuracy moved by 0.9pp. The framework is the invariant, not the infrastructure.
We changed the programming language, the vector store, the code execution engine, the type system, the concurrency model, and the metadata extraction strategy. Then we ran the same 2,556-query benchmark.
Accuracy moved by 0.9 percentage points.
| Python | C# | Delta | |
|---|---|---|---|
| Overall Accuracy | 82.9% | 83.8% | +0.9pp |
| Goals Achieved | 99.6% | 100% | +0.4pp |
| Avg Iterations | 1.9 | 1.4 | -26% |
Both runs use the same model (gpt-oss-120b via Fireworks AI) and the same embeddings (nomic-embed-text). Same model, same framework abstractions, different everything else.
Both outperform every published baseline on MultiHopRAG. The best prior method (IRCoT + RAG) reaches 75.0%. OpenSymbolicAI reaches 83% regardless of what sits underneath.
What matters is the framework: the abstractions that structure how the LLM reasons about the problem. Change everything below that layer and the accuracy holds.
What MultiHopRAG Tests#
MultiHopRAG evaluates whether a system can answer questions that require connecting facts across multiple documents. The dataset contains 609 news articles and 2,556 queries split into four types:
- Inference (32%): Connect facts across articles to identify a person, event, or outcome
- Comparison (33%): Compare claims between named news sources
- Temporal (23%): Assess consistency across different time periods
- Null (12%): Recognize when the corpus lacks sufficient information
Single-hop retrieval fails here. A question like "Who was the individual associated with cryptocurrency who was found guilty?" requires retrieving articles about the crypto industry and a trial, extracting a specific name, and cross-referencing across sources. The system needs to plan its retrieval, not just execute it.
Per-Type Breakdown#
| Query Type | Python | C# | Best Prior (IRCoT + RAG) |
|---|---|---|---|
| Inference | 88.0% | 91.3% | 80.1% |
| Comparison | 78.2% | 78.4% | 66.2% |
| Temporal | 76.5% | 74.4% | 60.4% |
| Null | 94.7% | 97.3% | 93.5% |
Both implementations hold above 74% on every query type. Prior methods collapse on comparison and temporal queries where multi-hop reasoning is required. OpenSymbolicAI does not.
Everything That Changed#
The two implementations share almost nothing at the infrastructure level.
| Component | Python | C# |
|---|---|---|
| Language | Python 3.12 | C# / .NET 10 |
| Vector Store | ChromaDB (external process) | LiteDB (embedded, single-file) |
| Code Execution | AST sanitizer + exec() | Roslyn scripting + PlanValidator |
| Metadata | Runtime introspection | Source generators at compile time |
| Type System | Pydantic GoalContext | Generic GoalSeeking<MultiHopContext> |
| Concurrency | Synchronous | async/await throughout |
| Packaging | pyproject.toml + uv | NuGet + dotnet CLI |
This is not swapping Flask for FastAPI. This is replacing the entire technology stack: runtime, database, execution model, type system, build system. If infrastructure determined correctness, these two implementations would produce different results. They don't.
Everything That Stayed the Same#
| Abstraction | Role |
|---|---|
| Model | gpt-oss-120b via Fireworks AI |
| Embeddings | nomic-embed-text via Fireworks AI |
| GoalSeeking blueprint | Iterative plan-execute-evaluate loop |
| 10 primitives | The LLM's toolkit for retrieval and reasoning |
| 7 decompositions | Few-shot patterns teaching multi-hop composition |
| Introspection boundary | Converts raw execution into structured context |
| Static evaluator | Checks: sufficient evidence + answer ready? |
Primitives define the LLM's toolkit. Both implementations expose the same 10: Retrieve, RetrieveByCategory, RetrieveBySource, RetrieveFiltered, ExtractEvidence, IdentifyEntities, GenerateNextQuery, SynthesizeAnswer, AssessSufficiency, and CombineContexts. Different syntax, same contract:
In Python:
@primitive(read_only=True)
def retrieve(self, query: str, k: int = 10) -> list[Document]:
return self.retriever.query(query, k=k)In C#:
[Primitive(ReadOnly = true)]
public Task<List<Document>> Retrieve(string query, int k = 10)
=> _retriever.QueryAsync(query, k);Decompositions teach the LLM how to compose those primitives. Both implementations provide the same 7 few-shot patterns: two-hop inference, source comparison, sufficiency check, consistency comparison, cross-source entity resolution, temporal source comparison, and yes/no temporal consistency. The logic is the same. Only the syntax changes.
The introspection boundary (UpdateContext) converts raw execution results into structured context: evidence pieces, entities found, queries tried, sufficiency status, current answer. The planner and evaluator never see raw traces. They see structured fields. This abstraction between "what happened" and "what we know" is language-independent.
Code Generation Across Languages#
The LLM generates executable plans in both Python and C#. The plans are structurally identical.
For a two-hop inference query, the generated Python plan:
docs = self.retrieve("cryptocurrency trial guilty verdict", k=5)
evidence = self.extract_evidence(self.combine_contexts(docs), question)
docs2 = self.retrieve(self.generate_next_query(question, evidence), k=5)
answer = self.synthesize_answer(question, evidence + self.combine_contexts(docs2))
self.assess_sufficiency(question, answer)The generated C# plan for the same query type:
var docs = await Retrieve("cryptocurrency trial guilty verdict", k: 5);
var evidence = await ExtractEvidence(CombineContexts(docs), question);
var docs2 = await Retrieve(await GenerateNextQuery(question, evidence), k: 5);
var answer = await SynthesizeAnswer(question, evidence + "\n" + CombineContexts(docs2));
await AssessSufficiency(question, answer);Same logic. Same flow. self.retrieve becomes await Retrieve. The framework provides the structure. The LLM fills in the syntax.
vs Published Baselines#
Results on the full 2,556-query dataset:
| Method | Overall | Inference | Comparison | Temporal | Null |
|---|---|---|---|---|---|
| RAG (BM25) | 38.0% | 44.2% | 33.9% | 26.8% | 55.1% |
| RAG (Dense) | 41.7% | 47.8% | 37.0% | 29.7% | 60.5% |
| IRCoT + RAG | 75.0% | 80.1% | 66.2% | 60.4% | 93.5% |
| Community-GraphRAG | 72.4% | 79.1% | 66.0% | 61.2% | 83.4% |
| OpenSymbolicAI (Python) | 82.9% | 88.0% | 78.2% | 76.5% | 94.7% |
| OpenSymbolicAI (C#) | 83.8% | 91.3% | 78.4% | 74.4% | 97.3% |
Both implementations beat the best prior method by 8-9 percentage points overall. The gap is widest on comparison queries (+12pp) and temporal queries (+14-16pp), the two types that require genuine multi-hop reasoning.
The Invariant#
Most agent frameworks are inseparable from their language. The abstractions live in the runtime, and if you change the language, you start from scratch.
OpenSymbolicAI's abstractions are language-independent. They structure how the LLM plans and reasons, not how the code executes. Swap the entire stack underneath and correctness holds. Teams can build agents in the language their stack already uses. Python for data science and ML pipelines. C# for enterprise .NET backends. Same patterns, same accuracy.
Reproduce It#
# Python
git clone https://github.com/OpenSymbolicAI/benchmark-py-MultiHopRAG
cd benchmark-py-MultiHopRAG
uv sync && python setup_data.py
uv run multihop-rag --demo
# C#
git clone https://github.com/OpenSymbolicAI/benchmark-cs-MultiHopRAG
cd benchmark-cs-MultiHopRAG
dotnet run -- --setup-data
dotnet run -- --demoRead more: TravelPlanner: 97.9% Where GPT-4 Gets 0.6% | Behavior Programming vs Tool Calling | LLM Attention Is Precious
See the code: Python Benchmark | C# Benchmark