Vision-Language

Zen vision-language models for multimodal image and text understanding, OCR, visual reasoning, and agentic tasks—scaling from 0.8B on-device to 235B frontier models.

The Zen Vision-Language family spans edge deployment (0.8B) through frontier multimodal reasoning (235B MoE). Each model supports image and text inputs, with specialized variants for instruction-following, agentic function-calling, and design generation. All models are trained on the Zen MoDE (Mixture of Distilled Experts) architecture with Apache 2.0 licensing.

Model Family

Model	Params	Context	HF	Paper
Zen 5	35B total / 3B active (MoE)	256K	weights	paper
Zen 5 Nano 0.8B	0.87B (dense)	—	weights	paper
Zen 5 Nano 2B	2.27B (dense)	—	weights	paper
Zen 5 Nano 4B (GGUF)	4.66B	—	weights	paper
Zen 5 Nano 9B (GGUF)	9.65B	—	weights	paper
Zen 3 VL	8.77B (30B MoE)	262K	weights	paper
Zen VL 4B Instruct	4B	32K	weights	paper
Zen VL 4B Agent	4B	32K	weights	paper
Zen VL 8B Instruct	8B	32K	weights	paper
Zen VL 8B Agent	8B	32K	weights	paper
Zen VL 30B Instruct	30B	256K	weights	paper
Zen VL 30B Agent	30B	256K	weights	paper
Zen Designer	235B total / 22B active (MoE)	128K	weights	paper
Zen Designer (GGUF)	235B total / 22B active (MoE)	256K	weights	paper
Zen Designer 235B A22B Instruct	236B	131K	weights	paper
Zen Designer 235B A22B Thinking	236B	131K	weights	paper

Quick Start

Load and run Zen VL 4B Instruct with Hugging Face Transformers:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

model_id = "zenlm/zen-vl-4b-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load an image
image = Image.open("photo.jpg")

# Build multimodal input
messages = [
    {"role": "user", "content": "Describe this image in detail."}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to(model.device)

# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

API Access

Zen vision-language models are also available via the OpenAI-compatible Zen API endpoint at api.hanzo.ai. Use any Zen VL model ID directly with the OpenAI Python client or equivalent:

from openai import OpenAI

client = OpenAI(base_url="https://api.hanzo.ai/v1", api_key="your-api-key")

response = client.chat.completions.create(
    model="zen-vl-4b-instruct",
    messages=[
        {
            "role": "user",
            "content": "What is in this image?"
        }
    ]
)

print(response.choices[0].message.content)

Model Selection

Edge (0.8B–2B): Zen 5 Nano series for phones, embedded devices, Raspberry Pi
Balanced (4B–8B): Zen VL 4B/8B for laptops, on-premises, VRAM-constrained servers
Frontier Reasoning (30B): Zen VL 30B for high-throughput reasoning, long-context documents
Design & VQA (235B): Zen Designer series for visual generation, multimodal design tasks
OCR & Vision (8.77B MoE): Zen 3 VL for text extraction (32 languages), visual grounding

Architecture

All Zen vision-language models use the Zen MoDE (Mixture of Distilled Experts) architecture, a sparse conditional computation framework that activates only necessary parameters at inference time. This enables efficient scaling across consumer GPUs while maintaining frontier-class reasoning quality.

Key features:

Sparse MoE: Only active parameters processed per token (e.g., 3B active in 35B total Zen 5)
Multimodal: Unified vision and language representation
Long context: Up to 256K–262K tokens (expandable to 1M)
Tool-calling: Agent and instruct variants support agentic function-calling
Multilingual: Broad language coverage and 32-language OCR
Apache 2.0: Open-source with permissive licensing