Zen LM
Models

Omni / Multimodal

Unified multimodal understanding across text, image, audio, and video with extended context windows and sparse MoE efficiency.

The Zen Omni model family unifies multimodal perception—text, image, audio, and video—in a single sparse MoE architecture. These models excel at cross-modal reasoning, speech-to-speech translation, visual understanding, and extended-context tasks.

Model Family

ModelParamsContextHFPaper
Zen Omni30B (3B active MoE)32Kweightspaper
Zen Omni 30B Instruct35.3B128Kweightspaper
Zen Omni 30B Thinking31.7B128Kweightspaper
Zen3 Omni35.3B (1T MoE)202Kweightspaper

Quick Start

Using Transformers

Load any Zen Omni model with HuggingFace transformers:

from transformers import AutoModelForCausalLM, AutoProcessor

# Load base multimodal model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Prepare image and text inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe what you see in this image."}
        ]
    }
]

# Generate response
text = processor.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    tokenize=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0]))

Zen API

For production deployments, use the Zen API endpoint at api.hanzo.ai:

curl -X POST https://api.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen-omni-30b-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          },
          {
            "type": "text",
            "text": "What is in this image?"
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

Model Variants

Each tier in the Omni family serves different deployment needs:

  • Zen Omni: Base multimodal model with efficient 3B-active sparse MoE and 32K context for standard inference.
  • Zen Omni 30B Instruct: Instruction-tuned variant optimized for chat, Q&A, and task-following with extended 128K context.
  • Zen Omni 30B Thinking: Chain-of-thought reasoning variant for complex problem-solving, math, and code with internal reasoning tokens.
  • Zen3 Omni: Flagship variant with 202K context window, unified audio/video/image/text understanding, and trillion-expert MoE.

Capabilities

Multimodal Understanding

  • Text: Support for 119 languages
  • Vision: Image analysis, visual reasoning, OCR across 30+ languages
  • Audio: Speech recognition (19 languages) and speech synthesis (10 languages)
  • Video: Frame-by-frame analysis and temporal reasoning

Cross-Modal Reasoning

  • Unified representations across modalities
  • Image-to-text generation and visual question-answering
  • Text-guided audio generation
  • Speech-to-speech translation with speaker preservation

Extended Context

  • Zen3 Omni supports 202K context for long-document understanding
  • Efficient sparse activation reduces compute during inference
  • Suitable for document analysis, multi-page transcription, and conversation history

Parameters & Context

ModelTotal ParamsActive ParamsContext
Zen Omni30B3B (MoE)32K
Zen Omni 30B Instruct35.3B128K
Zen Omni 30B Thinking31.7B128K
Zen3 Omni35.3B1T (MoE)202K

Use Cases

  • Accessibility: Dubbing, subtitle generation, audio description with multimodal context
  • Search: Multimodal retrieval combining text and visual understanding
  • Accessibility AI: Audio transcription with speaker diarization and visual cues
  • Document Analysis: Long-context understanding of PDFs with embedded images
  • Real-time Agents: Fast inference on edge or server with sparse MoE efficiency

On this page