Omni / Multimodal

Unified multimodal understanding across text, image, audio, and video with extended context windows and sparse MoE efficiency.

The Zen Omni model family unifies multimodal perception—text, image, audio, and video—in a single sparse MoE architecture. These models excel at cross-modal reasoning, speech-to-speech translation, visual understanding, and extended-context tasks.

Model Family

Model	Params	Context	HF	Paper
Zen Omni	30B (3B active MoE)	32K	weights	paper
Zen Omni 30B Instruct	35.3B	128K	weights	paper
Zen Omni 30B Thinking	31.7B	128K	weights	paper
Zen3 Omni	35.3B (1T MoE)	202K	weights	paper

Quick Start

Using Transformers

Load any Zen Omni model with HuggingFace transformers:

from transformers import AutoModelForCausalLM, AutoProcessor

# Load base multimodal model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Prepare image and text inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe what you see in this image."}
        ]
    }
]

# Generate response
text = processor.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    tokenize=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0]))

Zen API

For production deployments, use the Zen API endpoint at api.hanzo.ai:

curl -X POST https://api.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen-omni-30b-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          },
          {
            "type": "text",
            "text": "What is in this image?"
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

Model Variants

Each tier in the Omni family serves different deployment needs:

Zen Omni: Base multimodal model with efficient 3B-active sparse MoE and 32K context for standard inference.
Zen Omni 30B Instruct: Instruction-tuned variant optimized for chat, Q&A, and task-following with extended 128K context.
Zen Omni 30B Thinking: Chain-of-thought reasoning variant for complex problem-solving, math, and code with internal reasoning tokens.
Zen3 Omni: Flagship variant with 202K context window, unified audio/video/image/text understanding, and trillion-expert MoE.

Capabilities

Multimodal Understanding

Text: Support for 119 languages
Vision: Image analysis, visual reasoning, OCR across 30+ languages
Audio: Speech recognition (19 languages) and speech synthesis (10 languages)
Video: Frame-by-frame analysis and temporal reasoning

Unified representations across modalities
Image-to-text generation and visual question-answering
Text-guided audio generation
Speech-to-speech translation with speaker preservation

Extended Context

Zen3 Omni supports 202K context for long-document understanding
Efficient sparse activation reduces compute during inference
Suitable for document analysis, multi-page transcription, and conversation history

Parameters & Context

Model	Total Params	Active Params	Context
Zen Omni	30B	3B (MoE)	32K
Zen Omni 30B Instruct	35.3B	—	128K
Zen Omni 30B Thinking	31.7B	—	128K
Zen3 Omni	35.3B	1T (MoE)	202K

Use Cases

Accessibility: Dubbing, subtitle generation, audio description with multimodal context
Search: Multimodal retrieval combining text and visual understanding
Accessibility AI: Audio transcription with speaker diarization and visual cues
Document Analysis: Long-context understanding of PDFs with embedded images
Real-time Agents: Fast inference on edge or server with sparse MoE efficiency

Omni / Multimodal

On this page