Chat & Reasoning
Zen chat and reasoning models ranging from 0.6B edge deployments to frontier 1M-context reasoning systems.
Chat & Reasoning Models
The Zen Chat & Reasoning family provides a complete spectrum of language models for conversational AI and multi-step reasoning, from ultra-lightweight edge models (0.6B) to frontier reasoning systems (1M context). Built on the Zen MoDE architecture (Mixture of Diverse/Distilled Experts), these models deliver strong performance across general chat, specialized reasoning, and domain-specific tasks.
Model Gallery
| Model | Parameters | Context | Weights | Paper |
|---|---|---|---|---|
| Zen 5 | — | 1M | weights | paper |
| Zen 5 Pro | — | 512K | weights | paper |
| Zen 5 Pro (GGUF) | 284B / 37B active | 1M | weights | paper |
| Zen 5 Mini | — | 256K | weights | paper |
| Zen 5 Flash | 4.02B | 32K | weights | paper |
| Zen 5 Max | MoE (IQ2_XXS quant) | 1M | weights | paper |
| Zen Pro | 8.19B | 128K | weights | paper |
| Zen Blog | 8.19B | 128K | weights | paper |
| Zen Multilingual | 8B | 128K | weights | paper |
| Zen3 Nano | 8.19B | 40K | weights | paper |
| Zen Scribe | 2.35B | 32K | weights | paper |
| Zen Eco 4B Instruct | 4.02B | 32K | weights | paper |
| Zen Eco 4B Thinking | 4.02B | 32K | weights | paper |
| Zen Eco Instruct | 4B | 32K | weights | paper |
| Zen Eco | 0.75B | 32K | weights | paper |
| Zen Nano 0.6B | 0.6B | 32K | weights | paper |
| Zen Nano | 0.6B | 32K | weights | paper |
Quick Start
Local Inference with Transformers
Use any model from the table above. Here's an example with Zen Eco 4B Instruct:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "zenlm/zen-eco-4b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Explain quantum computing in one sentence."}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))Using the Zen API
For production deployments, use the OpenAI-compatible API at api.hanzo.ai:
curl https://api.hanzo.ai/v1/chat/completions \
-H "Authorization: Bearer $HANZO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "zen-eco-4b-instruct",
"messages": [
{"role": "user", "content": "Explain quantum computing in one sentence."}
],
"max_tokens": 512
}'Get your API key at console.hanzo.ai — includes $5 free credit on signup.
Model Selection Guide
For Edge & Mobile: Choose Zen Nano (0.6B) or Zen Eco (0.75–4B) for on-device deployment with minimal latency.
For Speed & Throughput: Zen 5 Flash (4.02B, 32K) delivers sub-100ms first-token-to-finish for high-volume routing.
For Reasoning & Depth: Zen 5 Pro (512K context) or Zen 5 (1M context) for complex multi-step tasks and long-document understanding.
For Content Generation: Zen Blog and Zen Scribe (2–8B) are tuned for structured writing and article generation.
For Multilingual: Zen Multilingual (8B, 128K) covers 100+ languages with strong cross-lingual understanding.
Architecture
All Zen chat models are built on Zen MoDE, a modern Mixture of Experts architecture featuring:
- Sparse Activation: Active parameters scale independently of total capacity
- Extended Context: Up to 1M tokens for frontier models
- Grouped Query Attention (GQA): Efficient inference without sacrificing quality
- Multi-lingual: Strong performance across 100+ languages
- Apache 2.0 License: Download, fine-tune, and deploy commercially
Format Support
All models are available in multiple formats for flexible deployment:
- SafeTensors (primary) — Full precision (bfloat16) for training and fine-tuning
- GGUF (quantized) — Q4_K_M, Q5_K_M, Q8_0, F16 for CPU and edge inference
- MLX (Apple Silicon) — Metal-accelerated inference on M1/M2/M3 chips