Omni / Multimodal
Unified multimodal understanding across text, image, audio, and video with extended context windows and sparse MoE efficiency.
The Zen Omni model family unifies multimodal perception—text, image, audio, and video—in a single sparse MoE architecture. These models excel at cross-modal reasoning, speech-to-speech translation, visual understanding, and extended-context tasks.
Model Family
| Model | Params | Context | HF | Paper |
|---|---|---|---|---|
| Zen Omni | 30B (3B active MoE) | 32K | weights | paper |
| Zen Omni 30B Instruct | 35.3B | 128K | weights | paper |
| Zen Omni 30B Thinking | 31.7B | 128K | weights | paper |
| Zen3 Omni | 35.3B (1T MoE) | 202K | weights | paper |
Quick Start
Using Transformers
Load any Zen Omni model with HuggingFace transformers:
from transformers import AutoModelForCausalLM, AutoProcessor
# Load base multimodal model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Prepare image and text inputs
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "Describe what you see in this image."}
]
}
]
# Generate response
text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0]))Zen API
For production deployments, use the Zen API endpoint at api.hanzo.ai:
curl -X POST https://api.hanzo.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "zen-omni-30b-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
},
{
"type": "text",
"text": "What is in this image?"
}
]
}
],
"max_tokens": 512
}'Model Variants
Each tier in the Omni family serves different deployment needs:
- Zen Omni: Base multimodal model with efficient 3B-active sparse MoE and 32K context for standard inference.
- Zen Omni 30B Instruct: Instruction-tuned variant optimized for chat, Q&A, and task-following with extended 128K context.
- Zen Omni 30B Thinking: Chain-of-thought reasoning variant for complex problem-solving, math, and code with internal reasoning tokens.
- Zen3 Omni: Flagship variant with 202K context window, unified audio/video/image/text understanding, and trillion-expert MoE.
Capabilities
Multimodal Understanding
- Text: Support for 119 languages
- Vision: Image analysis, visual reasoning, OCR across 30+ languages
- Audio: Speech recognition (19 languages) and speech synthesis (10 languages)
- Video: Frame-by-frame analysis and temporal reasoning
Cross-Modal Reasoning
- Unified representations across modalities
- Image-to-text generation and visual question-answering
- Text-guided audio generation
- Speech-to-speech translation with speaker preservation
Extended Context
- Zen3 Omni supports 202K context for long-document understanding
- Efficient sparse activation reduces compute during inference
- Suitable for document analysis, multi-page transcription, and conversation history
Parameters & Context
| Model | Total Params | Active Params | Context |
|---|---|---|---|
| Zen Omni | 30B | 3B (MoE) | 32K |
| Zen Omni 30B Instruct | 35.3B | — | 128K |
| Zen Omni 30B Thinking | 31.7B | — | 128K |
| Zen3 Omni | 35.3B | 1T (MoE) | 202K |
Use Cases
- Accessibility: Dubbing, subtitle generation, audio description with multimodal context
- Search: Multimodal retrieval combining text and visual understanding
- Accessibility AI: Audio transcription with speaker diarization and visual cues
- Document Analysis: Long-context understanding of PDFs with embedded images
- Real-time Agents: Fast inference on edge or server with sparse MoE efficiency
Vision-Language
Zen vision-language models for multimodal image and text understanding, OCR, visual reasoning, and agentic tasks—scaling from 0.8B on-device to 235B frontier models.
Embeddings & Retrieval
State-of-the-art text embeddings for semantic search, RAG, and retrieval pipelines across multiple scales.