Change Everything, Change Nothing: MultiHopRAG in Python and C#

We changed the programming language, the vector store, the code execution engine, the type system, the concurrency model, and the metadata extraction strategy. Then we ran the same 2,556-query benchmark.

Accuracy moved by 0.9 percentage points.

	Python	C#	Delta
Overall Accuracy	82.9%	83.8%	+0.9pp
Goals Achieved	99.6%	100%	+0.4pp
Avg Iterations	1.9	1.4	-26%

Both runs use the same model (gpt-oss-120b via Fireworks AI) and the same embeddings (nomic-embed-text). Same model, same framework abstractions, different everything else.

Both outperform every published baseline on MultiHopRAG. The best prior method (IRCoT + RAG) reaches 75.0%. OpenSymbolicAI reaches 83% regardless of what sits underneath.

What matters is the framework: the abstractions that structure how the LLM reasons about the problem. Change everything below that layer and the accuracy holds.

What changed vs what didn't: infrastructure differs completely, framework abstractions are identical, accuracy is the same

What MultiHopRAG Tests#

MultiHopRAG evaluates whether a system can answer questions that require connecting facts across multiple documents. The dataset contains 609 news articles and 2,556 queries split into four types:

Inference (32%): Connect facts across articles to identify a person, event, or outcome
Comparison (33%): Compare claims between named news sources
Temporal (23%): Assess consistency across different time periods
Null (12%): Recognize when the corpus lacks sufficient information

Single-hop retrieval fails here. A question like "Who was the individual associated with cryptocurrency who was found guilty?" requires retrieving articles about the crypto industry and a trial, extracting a specific name, and cross-referencing across sources. The system needs to plan its retrieval, not just execute it.

Per-Type Breakdown#

Query Type	Python	C#	Best Prior (IRCoT + RAG)
Inference	88.0%	91.3%	80.1%
Comparison	78.2%	78.4%	66.2%
Temporal	76.5%	74.4%	60.4%
Null	94.7%	97.3%	93.5%

Both implementations hold above 74% on every query type. Prior methods collapse on comparison and temporal queries where multi-hop reasoning is required. OpenSymbolicAI does not.

Everything That Changed#

The two implementations share almost nothing at the infrastructure level.

Component	Python	C#
Language	Python 3.12	C# / .NET 10
Vector Store	ChromaDB (external process)	LiteDB (embedded, single-file)
Code Execution	AST sanitizer + `exec()`	Roslyn scripting + `PlanValidator`
Metadata	Runtime introspection	Source generators at compile time
Type System	Pydantic `GoalContext`	Generic `GoalSeeking<MultiHopContext>`
Concurrency	Synchronous	`async/await` throughout
Packaging	`pyproject.toml` + uv	NuGet + `dotnet` CLI

This is not swapping Flask for FastAPI. This is replacing the entire technology stack: runtime, database, execution model, type system, build system. If infrastructure determined correctness, these two implementations would produce different results. They don't.

Everything That Stayed the Same#

Abstraction	Role
Model	`gpt-oss-120b` via Fireworks AI
Embeddings	`nomic-embed-text` via Fireworks AI
GoalSeeking blueprint	Iterative plan-execute-evaluate loop
10 primitives	The LLM's toolkit for retrieval and reasoning
7 decompositions	Few-shot patterns teaching multi-hop composition
Introspection boundary	Converts raw execution into structured context
Static evaluator	Checks: sufficient evidence + answer ready?

Primitives define the LLM's toolkit. Both implementations expose the same 10: Retrieve, RetrieveByCategory, RetrieveBySource, RetrieveFiltered, ExtractEvidence, IdentifyEntities, GenerateNextQuery, SynthesizeAnswer, AssessSufficiency, and CombineContexts. Different syntax, same contract:

In Python:

python

@primitive(read_only=True)
def retrieve(self, query: str, k: int = 10) -> list[Document]:
    return self.retriever.query(query, k=k)

In C#:

csharp

[Primitive(ReadOnly = true)]
public Task<List<Document>> Retrieve(string query, int k = 10)
    => _retriever.QueryAsync(query, k);

Decompositions teach the LLM how to compose those primitives. Both implementations provide the same 7 few-shot patterns: two-hop inference, source comparison, sufficiency check, consistency comparison, cross-source entity resolution, temporal source comparison, and yes/no temporal consistency. The logic is the same. Only the syntax changes.

The introspection boundary (UpdateContext) converts raw execution results into structured context: evidence pieces, entities found, queries tried, sufficiency status, current answer. The planner and evaluator never see raw traces. They see structured fields. This abstraction between "what happened" and "what we know" is language-independent.

Code Generation Across Languages#

The LLM generates executable plans in both Python and C#. The plans are structurally identical.

For a two-hop inference query, the generated Python plan:

python

docs = self.retrieve("cryptocurrency trial guilty verdict", k=5)
evidence = self.extract_evidence(self.combine_contexts(docs), question)
docs2 = self.retrieve(self.generate_next_query(question, evidence), k=5)
answer = self.synthesize_answer(question, evidence + self.combine_contexts(docs2))
self.assess_sufficiency(question, answer)

The generated C# plan for the same query type:

csharp

var docs = await Retrieve("cryptocurrency trial guilty verdict", k: 5);
var evidence = await ExtractEvidence(CombineContexts(docs), question);
var docs2 = await Retrieve(await GenerateNextQuery(question, evidence), k: 5);
var answer = await SynthesizeAnswer(question, evidence + "\n" + CombineContexts(docs2));
await AssessSufficiency(question, answer);

Same logic. Same flow. self.retrieve becomes await Retrieve. The framework provides the structure. The LLM fills in the syntax.

vs Published Baselines#

Results on the full 2,556-query dataset:

Method	Overall	Inference	Comparison	Temporal	Null
RAG (BM25)	38.0%	44.2%	33.9%	26.8%	55.1%
RAG (Dense)	41.7%	47.8%	37.0%	29.7%	60.5%
IRCoT + RAG	75.0%	80.1%	66.2%	60.4%	93.5%
Community-GraphRAG	72.4%	79.1%	66.0%	61.2%	83.4%
OpenSymbolicAI (Python)	82.9%	88.0%	78.2%	76.5%	94.7%
OpenSymbolicAI (C#)	83.8%	91.3%	78.4%	74.4%	97.3%

Both implementations beat the best prior method by 8-9 percentage points overall. The gap is widest on comparison queries (+12pp) and temporal queries (+14-16pp), the two types that require genuine multi-hop reasoning.

The Invariant#

Most agent frameworks are inseparable from their language. The abstractions live in the runtime, and if you change the language, you start from scratch.

OpenSymbolicAI's abstractions are language-independent. They structure how the LLM plans and reasons, not how the code executes. Swap the entire stack underneath and correctness holds. Teams can build agents in the language their stack already uses. Python for data science and ML pipelines. C# for enterprise .NET backends. Same patterns, same accuracy.

Reproduce It#

The MultiHopRAG benchmark suite (Python and C#) is available on request — email rajkumar@opensymbolic.ai for access.