Custom LLM with a response cache
Subclass LLM to connect to any provider. Add a cache so repeated prompts are served from memory instead of making another network call.
Before you start
Every tutorial so far passed a model name string and let the framework
pick the provider. This tutorial shows what happens one level down: how
to subclass LLM, implement the one method the framework requires, and
plug in a cache so repeated prompts cost nothing.
Why subclass LLM?#
The built-in Ollama integration covers most cases. You subclass LLM
when you need to connect to a provider the framework does not support,
add request-level logic (logging, auth headers, retries with custom
backoff), or wire in a cache backed by Redis, SQLite, or any other store.
The interface is narrow on purpose: implement _generate_impl and the
rest of the framework works as before.
The custom LLM#
Subclass LLM and implement one method. The base class calls
_generate_impl only on a cache miss; retries and cache storage are
handled for you.
# my_ollama.py
import json
import urllib.request
from opensymbolicai.llm import LLM, LLMConfig, LLMResponse, TokenUsage
class MyOllamaLLM(LLM):
_DISPLAY_NAME = "MyOllama"
def __init__(self, model: str, cache=None) -> None:
super().__init__(LLMConfig(provider="ollama", model=model), cache=cache)
def _generate_impl(self, prompt: str, **kwargs) -> LLMResponse:
payload = json.dumps({
"model": self.config.model,
"prompt": prompt,
"stream": False,
**kwargs,
}).encode()
request = urllib.request.Request(
"http://localhost:11434/api/generate",
data=payload,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(request) as resp:
data = json.loads(resp.read())
return LLMResponse(
text=data["response"],
usage=TokenUsage(
input_tokens=data.get("prompt_eval_count", 0),
output_tokens=data.get("eval_count", 0),
),
provider="my-ollama",
model=self.config.model,
)_generate_impl makes the HTTP call and maps the raw response to
LLMResponse. That is the entire integration. The base class wraps it
with retry logic and cache lookup.
The cache#
Subclass LLMCache and implement four methods: get, set, delete,
clear. InMemoryCache stores responses in a plain dict and counts
hits and misses:
# my_ollama.py (continued)
from opensymbolicai.llm import CacheEntry, LLMCache
class InMemoryCache(LLMCache):
def __init__(self) -> None:
self._store: dict[str, CacheEntry] = {}
self._hits = 0
self._misses = 0
@property
def hits(self) -> int:
return self._hits
@property
def misses(self) -> int:
return self._misses
def reset(self) -> None:
self._hits = 0
self._misses = 0
def get(self, key: str) -> CacheEntry | None:
entry = self._store.get(key)
if entry is not None:
self._hits += 1
else:
self._misses += 1
return entry
def set(self, key: str, entry: CacheEntry) -> None:
self._store[key] = entry
def delete(self, key: str) -> bool:
return self._store.pop(key, None) is not None
def clear(self) -> None:
self._store.clear()Wire the two together and pass the result to any agent:
# main.py
cache = InMemoryCache()
llm = MyOllamaLLM(model="qwen2.5-coder:7b", cache=cache)
agent = CalculatorAgent(llm=llm)The agent does not need to know a custom LLM is in use. Any agent that
takes a model name string also accepts an LLM instance via llm=.
Run it#
uv add opensymbolicai-core
ollama pull qwen2.5-coder:7b
uv run main.pymain.py asks three calculator questions twice. The first run goes to
Ollama; the second is served from the in-memory cache.
Sample output#
Run 1 (cold cache)
----------------------------------------
Q: What is 12 multiplied by 8?
Plan : result = multiply(12, 8)
return result
Answer : 96
Q: What is 144 divided by 12?
Plan : result = divide(144, 12)
return result
Answer : 12.0
Q: What is 50 plus 37?
Plan : result = add(50, 37)
return result
Answer : 87
Cache — hits: 0 misses: 3
Run 2 (warm cache)
----------------------------------------
Q: What is 12 multiplied by 8?
Plan : result = multiply(12, 8)
return result
Answer : 96
Q: What is 144 divided by 12?
Plan : result = divide(144, 12)
return result
Answer : 12.0
Q: What is 50 plus 37?
Plan : result = add(50, 37)
return result
Answer : 87
Cache — hits: 3 misses: 0What to notice#
_generate_impl is never called in run 2. The base class computes a
SHA-256 key from the model config and the prompt. On the second run each
prompt matches a stored key and the cached LLMResponse is returned
directly. The Ollama server receives no requests.
The cache is shared across agent instances. main.py creates a fresh
CalculatorAgent for each question but passes the same llm object.
Because the LLM holds the cache, all three agents share the same
response store. Caching at the LLM level is more useful than caching at
the agent level: every agent that uses the same LLM instance benefits.
_DISPLAY_NAME appears in logs. Setting it to "MyOllama" means
trace output and error messages say MyOllama rather than the default
class name. A small detail, but useful when running multiple LLM backends
in the same process.
The integration surface is narrow by design. Two things to implement:
| What you implement | What you get for free |
|---|---|
_generate_impl | retries, cache lookup and store |
LLMCache.get/set/delete/clear | called automatically by every LLM |
Swapping InMemoryCache for a Redis-backed implementation means changing
the class and nothing else. The agent, the LLM class, and the framework
all stay the same.