Back to Blog

Third Language, Same Result: MultiHopRAG in Go

Go joins Python and C# on the MultiHopRAG benchmark. Different runtime, different vector store, single static binary. Accuracy: 81.6%. The framework holds.

OpenSymbolicAI TeamMarch 24, 20264 min read
benchmarkcross-languagecode-generationagentsreliabilitygo

We added a third language. Go compiles to a single static binary, uses goroutines instead of threads, and embeds its vector store in-process. No runtime, no server, no dependencies.

Accuracy: 81.6%. Within 2.2 percentage points of Python and C#.

PythonC#GoDelta (max)
Overall Accuracy82.9%83.8%81.6%2.2pp
Avg Iterations1.91.42.6+0.7

All three use the same model (gpt-oss-120b via Fireworks AI) and the same embeddings (nomic-embed-text). Same framework abstractions. Different everything else.

All three outperform every published baseline on MultiHopRAG.

Per-Type Breakdown#

Query TypePythonC#GoBest Prior (IRCoT + RAG)
Inference88.0%91.3%87.7%80.1%
Comparison78.2%78.4%73.9%66.2%
Temporal76.5%74.4%75.0%60.4%
Null94.7%97.3%99.3%93.5%

Go scores highest on null queries at 99.3%, misclassifying only 2 out of 301. On temporal queries it lands between Python and C#. The gap is on comparison queries, where metadata filtering limitations in the embedded vector store make source-specific retrieval harder.

What Changed This Time#

ComponentPythonC#Go
LanguagePython 3.12C# / .NET 10Go 1.22
Vector StoreChromaDB (external)LiteDB (embedded)chromem-go (embedded)
Code ExecutionAST sanitizer + exec()Roslyn scriptingAST interpreter (core-go)
MetadataRuntime introspectionSource generatorsgo generate directives
Type SystemPydantic GoalContextGeneric GoalSeeking<T>Go structs + interfaces
ConcurrencySynchronousasync/awaitGoroutines + semaphore
Packagingpyproject.toml + uvNuGet + dotnet CLIgo.mod -> single binary

Three languages. Three runtimes. Three vector stores. Three concurrency models. The framework abstractions (10 primitives, the introspection boundary, the goal-seeking loop) remain identical. The decompositions are intentionally different: Go uses 13 phased patterns where Python and C# use 7. More on that below.

What's Different About Go#

Single binary deployment. go build -o multihop-rag . produces one executable. No Python environment, no .NET runtime, no database server. The vector store is embedded in the process and persists to disk.

Goroutine parallelism. The benchmark runs 10 parallel workers using a semaphore pattern. Each worker gets its own LLM client and agent instance with no shared state.

Phased decompositions. The Go implementation uses 13 decomposition patterns (vs 7 in Python/C#), split into hop-1 (retrieve + extract + assess) and hop-2 (assess + synthesize). This forces the agent to spread work across iterations rather than cramming everything into one plan. The trade-off: higher average iterations (2.6 vs 1.4-1.9) but more thorough evidence gathering.

Metadata filtering limits. chromem-go supports exact-match metadata filters only. Date range queries require post-filtering in Go code, while ChromaDB handles them natively. This likely accounts for the comparison query gap, where source-specific retrieval is less precise with exact-match constraints.

Code Generation Across Three Languages#

The same query, planned in three languages:

python
# Python
docs = self.retrieve("cryptocurrency trial guilty verdict", k=5)
evidence = self.extract_evidence(self.combine_contexts(docs), question)
self.assess_sufficiency(question, evidence)
csharp
// C#
var docs = await Retrieve("cryptocurrency trial guilty verdict", k: 5);
var evidence = await ExtractEvidence(CombineContexts(docs), question);
await AssessSufficiency(question, evidence);
go
// Go
docs := self.Retrieve("cryptocurrency trial guilty verdict", 5)
evidence := self.ExtractEvidence(self.CombineContexts(docs), question)
self.AssessSufficiency(question, evidence)

Same logic. Same flow. The LLM adapts the syntax to the target language. The framework provides the structure.

The Pattern Holds#

Three languages is no longer a pair. It's a pattern. Python for ML pipelines. C# for enterprise backends. Go for infrastructure and CLIs. Each team uses their stack. Each team gets 80%+ accuracy on a benchmark where the best prior method tops out at 75%.

The framework is the invariant.

Reproduce It#

bash
# Go
git clone https://github.com/OpenSymbolicAI/benchmark-go-MultiHopRAG
cd benchmark-go-MultiHopRAG
go build -o multihop-rag .
./multihop-rag --setup-data
./multihop-rag --demo

Read more: MultiHopRAG in Python and C# | TravelPlanner: 97.9% Where GPT-4 Gets 0.6% | Behavior Programming vs Tool Calling

See the code: Go Benchmark | Python Benchmark | C# Benchmark