Back to Blog

Change Everything, Change Nothing: MultiHopRAG in Python and C#

We swapped the language, the vector store, the code executor, and the type system. Accuracy moved by 0.9pp. The framework is the invariant, not the infrastructure.

OpenSymbolicAI TeamMarch 23, 20266 min read
benchmarkcross-languagecode-generationagentsreliability

We changed the programming language, the vector store, the code execution engine, the type system, the concurrency model, and the metadata extraction strategy. Then we ran the same 2,556-query benchmark.

Accuracy moved by 0.9 percentage points.

PythonC#Delta
Overall Accuracy82.9%83.8%+0.9pp
Goals Achieved99.6%100%+0.4pp
Avg Iterations1.91.4-26%

Both runs use the same model (gpt-oss-120b via Fireworks AI) and the same embeddings (nomic-embed-text). Same model, same framework abstractions, different everything else.

Both outperform every published baseline on MultiHopRAG. The best prior method (IRCoT + RAG) reaches 75.0%. OpenSymbolicAI reaches 83% regardless of what sits underneath.

What matters is the framework: the abstractions that structure how the LLM reasons about the problem. Change everything below that layer and the accuracy holds.

What changed vs what didn't: infrastructure differs completely, framework abstractions are identical, accuracy is the same

What MultiHopRAG Tests#

MultiHopRAG evaluates whether a system can answer questions that require connecting facts across multiple documents. The dataset contains 609 news articles and 2,556 queries split into four types:

  • Inference (32%): Connect facts across articles to identify a person, event, or outcome
  • Comparison (33%): Compare claims between named news sources
  • Temporal (23%): Assess consistency across different time periods
  • Null (12%): Recognize when the corpus lacks sufficient information

Single-hop retrieval fails here. A question like "Who was the individual associated with cryptocurrency who was found guilty?" requires retrieving articles about the crypto industry and a trial, extracting a specific name, and cross-referencing across sources. The system needs to plan its retrieval, not just execute it.

Per-Type Breakdown#

Query TypePythonC#Best Prior (IRCoT + RAG)
Inference88.0%91.3%80.1%
Comparison78.2%78.4%66.2%
Temporal76.5%74.4%60.4%
Null94.7%97.3%93.5%

Both implementations hold above 74% on every query type. Prior methods collapse on comparison and temporal queries where multi-hop reasoning is required. OpenSymbolicAI does not.

Everything That Changed#

The two implementations share almost nothing at the infrastructure level.

ComponentPythonC#
LanguagePython 3.12C# / .NET 10
Vector StoreChromaDB (external process)LiteDB (embedded, single-file)
Code ExecutionAST sanitizer + exec()Roslyn scripting + PlanValidator
MetadataRuntime introspectionSource generators at compile time
Type SystemPydantic GoalContextGeneric GoalSeeking<MultiHopContext>
ConcurrencySynchronousasync/await throughout
Packagingpyproject.toml + uvNuGet + dotnet CLI

This is not swapping Flask for FastAPI. This is replacing the entire technology stack: runtime, database, execution model, type system, build system. If infrastructure determined correctness, these two implementations would produce different results. They don't.

Everything That Stayed the Same#

AbstractionRole
Modelgpt-oss-120b via Fireworks AI
Embeddingsnomic-embed-text via Fireworks AI
GoalSeeking blueprintIterative plan-execute-evaluate loop
10 primitivesThe LLM's toolkit for retrieval and reasoning
7 decompositionsFew-shot patterns teaching multi-hop composition
Introspection boundaryConverts raw execution into structured context
Static evaluatorChecks: sufficient evidence + answer ready?

Primitives define the LLM's toolkit. Both implementations expose the same 10: Retrieve, RetrieveByCategory, RetrieveBySource, RetrieveFiltered, ExtractEvidence, IdentifyEntities, GenerateNextQuery, SynthesizeAnswer, AssessSufficiency, and CombineContexts. Different syntax, same contract:

In Python:

python
@primitive(read_only=True)
def retrieve(self, query: str, k: int = 10) -> list[Document]:
    return self.retriever.query(query, k=k)

In C#:

csharp
[Primitive(ReadOnly = true)]
public Task<List<Document>> Retrieve(string query, int k = 10)
    => _retriever.QueryAsync(query, k);

Decompositions teach the LLM how to compose those primitives. Both implementations provide the same 7 few-shot patterns: two-hop inference, source comparison, sufficiency check, consistency comparison, cross-source entity resolution, temporal source comparison, and yes/no temporal consistency. The logic is the same. Only the syntax changes.

The introspection boundary (UpdateContext) converts raw execution results into structured context: evidence pieces, entities found, queries tried, sufficiency status, current answer. The planner and evaluator never see raw traces. They see structured fields. This abstraction between "what happened" and "what we know" is language-independent.

Code Generation Across Languages#

The LLM generates executable plans in both Python and C#. The plans are structurally identical.

For a two-hop inference query, the generated Python plan:

python
docs = self.retrieve("cryptocurrency trial guilty verdict", k=5)
evidence = self.extract_evidence(self.combine_contexts(docs), question)
docs2 = self.retrieve(self.generate_next_query(question, evidence), k=5)
answer = self.synthesize_answer(question, evidence + self.combine_contexts(docs2))
self.assess_sufficiency(question, answer)

The generated C# plan for the same query type:

csharp
var docs = await Retrieve("cryptocurrency trial guilty verdict", k: 5);
var evidence = await ExtractEvidence(CombineContexts(docs), question);
var docs2 = await Retrieve(await GenerateNextQuery(question, evidence), k: 5);
var answer = await SynthesizeAnswer(question, evidence + "\n" + CombineContexts(docs2));
await AssessSufficiency(question, answer);

Same logic. Same flow. self.retrieve becomes await Retrieve. The framework provides the structure. The LLM fills in the syntax.

vs Published Baselines#

Results on the full 2,556-query dataset:

MethodOverallInferenceComparisonTemporalNull
RAG (BM25)38.0%44.2%33.9%26.8%55.1%
RAG (Dense)41.7%47.8%37.0%29.7%60.5%
IRCoT + RAG75.0%80.1%66.2%60.4%93.5%
Community-GraphRAG72.4%79.1%66.0%61.2%83.4%
OpenSymbolicAI (Python)82.9%88.0%78.2%76.5%94.7%
OpenSymbolicAI (C#)83.8%91.3%78.4%74.4%97.3%

Both implementations beat the best prior method by 8-9 percentage points overall. The gap is widest on comparison queries (+12pp) and temporal queries (+14-16pp), the two types that require genuine multi-hop reasoning.

The Invariant#

Most agent frameworks are inseparable from their language. The abstractions live in the runtime, and if you change the language, you start from scratch.

OpenSymbolicAI's abstractions are language-independent. They structure how the LLM plans and reasons, not how the code executes. Swap the entire stack underneath and correctness holds. Teams can build agents in the language their stack already uses. Python for data science and ML pipelines. C# for enterprise .NET backends. Same patterns, same accuracy.

Reproduce It#

bash
# Python
git clone https://github.com/OpenSymbolicAI/benchmark-py-MultiHopRAG
cd benchmark-py-MultiHopRAG
uv sync && python setup_data.py
uv run multihop-rag --demo

# C#
git clone https://github.com/OpenSymbolicAI/benchmark-cs-MultiHopRAG
cd benchmark-cs-MultiHopRAG
dotnet run -- --setup-data
dotnet run -- --demo

Read more: TravelPlanner: 97.9% Where GPT-4 Gets 0.6% | Behavior Programming vs Tool Calling | LLM Attention Is Precious

See the code: Python Benchmark | C# Benchmark