Two models, one agent
Use a vision model and a text model together in a single agent. The VLM describes the image; the text model reasons about what it saw.
Before you start
Not every question can be answered by one model. Some questions need a vision model to read an image; the same model may not reason well in text. This tutorial wires two local models into a single agent: a VLM that sees, and a text model that thinks.
Why two models?#
A text model like qwen2.5-coder:7b plans well and produces clean answers,
but it cannot accept image input. A vision model like gemma4:e2b can
describe images in detail, but it is weaker at structured reasoning.
Splitting the work by capability is more reliable than asking one model to do both. The text model writes the plan and produces the final answer; the VLM's only job is to describe what it sees.
The agent#
ImageAgent takes two LLM instances as constructor arguments. The planning
model is passed to super().__init__ as llm; the vision model is stored
as self._vision_llm. Each primitive calls whichever model fits the task.
# image_agent.py
class ImageAgent(PlanExecute):
def __init__(self, llm: LLM, vision_llm: LLM) -> None:
super().__init__(llm=llm)
self._vision_llm = vision_llm
@primitive(read_only=True)
def describe_image(self, path: str) -> str:
"""Use the vision model to describe the contents of an image file.
Example: describe_image("photo.jpg")
-> "A landscape with mountains and a river..."
"""
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
return self._vision_llm.generate(
"Describe everything you see in this image in detail.",
images=[b64],
).text.strip()
@primitive(read_only=True)
def answer(self, question: str, description: str) -> str:
"""Use the text model to answer a specific question from a description.
Example: answer("What colors are dominant?", "A mountain scene with...")
-> "Green and grey are the dominant colors."
"""
prompt = (
f"Image description: {description}\n\n"
f"Question: {question}\n"
f"Answer in one or two sentences."
)
return self._llm.generate(prompt).text.strip()
@primitive(read_only=True)
def respond(self, message: str) -> str:
"""Return the message as the final answer."""
return messageThe images=[b64] keyword passes through **kwargs in _generate_impl
directly to the Ollama request payload. The same OllamaLLM class works
for text and vision models; Ollama routes the payload to the right model.
Install#
uv add opensymbolicai-core
ollama pull qwen2.5-coder:7b
ollama pull gemma4:e2bA sample image is downloaded automatically on first run.
Run it#
uv run main.pyThree questions are asked about the same image:
text_llm = OllamaLLM(model="qwen2.5-coder:7b")
vision_llm = OllamaLLM(model="gemma4:e2b")
for question in QUESTIONS:
agent = ImageAgent(llm=text_llm, vision_llm=vision_llm)
result = agent.run(f"Look at {SAMPLE_IMAGE}. {question}")Sample output#
Text model : qwen2.5-coder:7b
Vision model: gemma4:e2b
Image : sample.jpg
==================================================
Q: What is in this image?
Plan : description = describe_image('sample.jpg')
result = answer('What is in this image?', description)
return respond(result)
Answer: A panoramic view of a rugged, mountainous landscape with a deep
valley, dense forestation, and clear bright lighting.
Q: What colors are most prominent?
Plan : description = describe_image('sample.jpg')
colors = answer('What colors are most prominent?', description)
return respond(colors)
Answer: Dark green for the lush vegetation, gray for the rock formations,
blue for the sky, and white for the wispy clouds.
Q: Is there any water visible?
Plan : description = describe_image('sample.jpg')
result = answer('Is there any water visible?', description)
return respond(result)
Answer: Yes, there is a calm body of water running through the bottom
of the valley.What to notice#
The text model never sees the image. qwen2.5-coder:7b writes the plan
and calls answer with a text description. It never receives raw image
bytes. The VLM never sees the question; it only describes what is visible.
Each model stays within its strengths.
describe_image is called once and cached. Both OllamaLLM instances
are constructed with InMemoryCache(). The first question triggers a VLM
call; the second and third questions get the same description from cache.
The VLM is only asked once per unique image path per run.
Injecting models as constructor arguments is the pattern. Extra models
go in alongside llm=, stored as instance variables, and called directly
inside primitives. The planning model (passed as llm) stays responsible
for writing plans; it does not know that other models exist.
text_llm = OllamaLLM(model="qwen2.5-coder:7b")
vision_llm = OllamaLLM(model="gemma4:e2b")
agent = ImageAgent(llm=text_llm, vision_llm=vision_llm)The same pattern extends to any combination: an embedding model, a code model, a reasoning model. Each is injected where needed and called only by the primitive that fits.