Custom LLM with a response cache

Every tutorial so far passed a model name string and let the framework pick the provider. This tutorial shows what happens one level down: how to subclass LLM, implement the one method the framework requires, and plug in a cache so repeated prompts cost nothing.

Why subclass LLM?#

The built-in Ollama integration covers most cases. You subclass LLM when you need to connect to a provider the framework does not support, add request-level logic (logging, auth headers, retries with custom backoff), or wire in a cache backed by Redis, SQLite, or any other store.

The interface is narrow on purpose: implement _generate_impl and the rest of the framework works as before.

The custom LLM#

Subclass LLM and implement one method. The base class calls _generate_impl only on a cache miss; retries and cache storage are handled for you.

python

# my_ollama.py
import json
import urllib.request

from opensymbolicai.llm import LLM, LLMConfig, LLMResponse, TokenUsage

class MyOllamaLLM(LLM):

    _DISPLAY_NAME = "MyOllama"

    def __init__(self, model: str, cache=None) -> None:
        super().__init__(LLMConfig(provider="ollama", model=model), cache=cache)

    def _generate_impl(self, prompt: str, **kwargs) -> LLMResponse:
        payload = json.dumps({
            "model": self.config.model,
            "prompt": prompt,
            "stream": False,
            **kwargs,
        }).encode()

        request = urllib.request.Request(
            "http://localhost:11434/api/generate",
            data=payload,
            headers={"Content-Type": "application/json"},
            method="POST",
        )

        with urllib.request.urlopen(request) as resp:
            data = json.loads(resp.read())

        return LLMResponse(
            text=data["response"],
            usage=TokenUsage(
                input_tokens=data.get("prompt_eval_count", 0),
                output_tokens=data.get("eval_count", 0),
            ),
            provider="my-ollama",
            model=self.config.model,
        )

_generate_impl makes the HTTP call and maps the raw response to LLMResponse. That is the entire integration. The base class wraps it with retry logic and cache lookup.

The cache#

Subclass LLMCache and implement four methods: get, set, delete, clear. InMemoryCache stores responses in a plain dict and counts hits and misses:

python

# my_ollama.py (continued)
from opensymbolicai.llm import CacheEntry, LLMCache

class InMemoryCache(LLMCache):

    def __init__(self) -> None:
        self._store: dict[str, CacheEntry] = {}
        self._hits = 0
        self._misses = 0

    @property
    def hits(self) -> int:
        return self._hits

    @property
    def misses(self) -> int:
        return self._misses

    def reset(self) -> None:
        self._hits = 0
        self._misses = 0

    def get(self, key: str) -> CacheEntry | None:
        entry = self._store.get(key)
        if entry is not None:
            self._hits += 1
        else:
            self._misses += 1
        return entry

    def set(self, key: str, entry: CacheEntry) -> None:
        self._store[key] = entry

    def delete(self, key: str) -> bool:
        return self._store.pop(key, None) is not None

    def clear(self) -> None:
        self._store.clear()

Wire the two together and pass the result to any agent:

python

# main.py
cache = InMemoryCache()
llm   = MyOllamaLLM(model="qwen2.5-coder:7b", cache=cache)
agent = CalculatorAgent(llm=llm)

The agent does not need to know a custom LLM is in use. Any agent that takes a model name string also accepts an LLM instance via llm=.

Run it#

bash

uv add opensymbolicai-core
ollama pull qwen2.5-coder:7b
uv run main.py

main.py asks three calculator questions twice. The first run goes to Ollama; the second is served from the in-memory cache.

Sample output#

text

Run 1 (cold cache)
----------------------------------------
Q: What is 12 multiplied by 8?
   Plan   : result = multiply(12, 8)
            return result
   Answer : 96
Q: What is 144 divided by 12?
   Plan   : result = divide(144, 12)
            return result
   Answer : 12.0
Q: What is 50 plus 37?
   Plan   : result = add(50, 37)
            return result
   Answer : 87

Cache — hits: 0  misses: 3

Run 2 (warm cache)
----------------------------------------
Q: What is 12 multiplied by 8?
   Plan   : result = multiply(12, 8)
            return result
   Answer : 96
Q: What is 144 divided by 12?
   Plan   : result = divide(144, 12)
            return result
   Answer : 12.0
Q: What is 50 plus 37?
   Plan   : result = add(50, 37)
            return result
   Answer : 87

Cache — hits: 3  misses: 0

What to notice#

_generate_impl is never called in run 2. The base class computes a SHA-256 key from the model config and the prompt. On the second run each prompt matches a stored key and the cached LLMResponse is returned directly. The Ollama server receives no requests.

The cache is shared across agent instances. main.py creates a fresh CalculatorAgent for each question but passes the same llm object. Because the LLM holds the cache, all three agents share the same response store. Caching at the LLM level is more useful than caching at the agent level: every agent that uses the same LLM instance benefits.

_DISPLAY_NAME appears in logs. Setting it to "MyOllama" means trace output and error messages say MyOllama rather than the default class name. A small detail, but useful when running multiple LLM backends in the same process.

The integration surface is narrow by design. Two things to implement:

What you implement	What you get for free
`_generate_impl`	retries, cache lookup and store
`LLMCache.get/set/delete/clear`	called automatically by every `LLM`

Swapping InMemoryCache for a Redis-backed implementation means changing the class and nothing else. The agent, the LLM class, and the framework all stay the same.