All tutorials
Track 42·Providers & Integration

Two models, one agent

Use a vision model and a text model together in a single agent. The VLM describes the image; the text model reasons about what it saw.

intermediate8 min
Video coming soon
Browse this tutorial's folder in tutorials-pygithub.com/OpenSymbolicAI/tutorials-py/tree/main/42-multi-model

Not every question can be answered by one model. Some questions need a vision model to read an image; the same model may not reason well in text. This tutorial wires two local models into a single agent: a VLM that sees, and a text model that thinks.

Why two models?#

A text model like qwen2.5-coder:7b plans well and produces clean answers, but it cannot accept image input. A vision model like gemma4:e2b can describe images in detail, but it is weaker at structured reasoning.

Splitting the work by capability is more reliable than asking one model to do both. The text model writes the plan and produces the final answer; the VLM's only job is to describe what it sees.

The agent#

ImageAgent takes two LLM instances as constructor arguments. The planning model is passed to super().__init__ as llm; the vision model is stored as self._vision_llm. Each primitive calls whichever model fits the task.

python
# image_agent.py
class ImageAgent(PlanExecute):

    def __init__(self, llm: LLM, vision_llm: LLM) -> None:
        super().__init__(llm=llm)
        self._vision_llm = vision_llm

    @primitive(read_only=True)
    def describe_image(self, path: str) -> str:
        """Use the vision model to describe the contents of an image file.

        Example: describe_image("photo.jpg")
                 -> "A landscape with mountains and a river..."
        """
        with open(path, "rb") as f:
            b64 = base64.b64encode(f.read()).decode()
        return self._vision_llm.generate(
            "Describe everything you see in this image in detail.",
            images=[b64],
        ).text.strip()

    @primitive(read_only=True)
    def answer(self, question: str, description: str) -> str:
        """Use the text model to answer a specific question from a description.

        Example: answer("What colors are dominant?", "A mountain scene with...")
                 -> "Green and grey are the dominant colors."
        """
        prompt = (
            f"Image description: {description}\n\n"
            f"Question: {question}\n"
            f"Answer in one or two sentences."
        )
        return self._llm.generate(prompt).text.strip()

    @primitive(read_only=True)
    def respond(self, message: str) -> str:
        """Return the message as the final answer."""
        return message

The images=[b64] keyword passes through **kwargs in _generate_impl directly to the Ollama request payload. The same OllamaLLM class works for text and vision models; Ollama routes the payload to the right model.

Install#

bash
uv add opensymbolicai-core
ollama pull qwen2.5-coder:7b
ollama pull gemma4:e2b

A sample image is downloaded automatically on first run.

Run it#

bash
uv run main.py

Three questions are asked about the same image:

python
text_llm   = OllamaLLM(model="qwen2.5-coder:7b")
vision_llm = OllamaLLM(model="gemma4:e2b")

for question in QUESTIONS:
    agent = ImageAgent(llm=text_llm, vision_llm=vision_llm)
    result = agent.run(f"Look at {SAMPLE_IMAGE}. {question}")

Sample output#

text
Text model  : qwen2.5-coder:7b
Vision model: gemma4:e2b
Image       : sample.jpg
==================================================

Q: What is in this image?
   Plan  : description = describe_image('sample.jpg')
           result = answer('What is in this image?', description)
           return respond(result)
   Answer: A panoramic view of a rugged, mountainous landscape with a deep
           valley, dense forestation, and clear bright lighting.

Q: What colors are most prominent?
   Plan  : description = describe_image('sample.jpg')
           colors = answer('What colors are most prominent?', description)
           return respond(colors)
   Answer: Dark green for the lush vegetation, gray for the rock formations,
           blue for the sky, and white for the wispy clouds.

Q: Is there any water visible?
   Plan  : description = describe_image('sample.jpg')
           result = answer('Is there any water visible?', description)
           return respond(result)
   Answer: Yes, there is a calm body of water running through the bottom
           of the valley.

What to notice#

The text model never sees the image. qwen2.5-coder:7b writes the plan and calls answer with a text description. It never receives raw image bytes. The VLM never sees the question; it only describes what is visible. Each model stays within its strengths.

describe_image is called once and cached. Both OllamaLLM instances are constructed with InMemoryCache(). The first question triggers a VLM call; the second and third questions get the same description from cache. The VLM is only asked once per unique image path per run.

Injecting models as constructor arguments is the pattern. Extra models go in alongside llm=, stored as instance variables, and called directly inside primitives. The planning model (passed as llm) stays responsible for writing plans; it does not know that other models exist.

python
text_llm   = OllamaLLM(model="qwen2.5-coder:7b")
vision_llm = OllamaLLM(model="gemma4:e2b")

agent = ImageAgent(llm=text_llm, vision_llm=vision_llm)

The same pattern extends to any combination: an embedding model, a code model, a reasoning model. Each is injected where needed and called only by the primitive that fits.