Zen chat and reasoning models ranging from 0.6B edge deployments to frontier 1M-context reasoning systems.

Chat & Reasoning Models

The Zen Chat & Reasoning family provides a complete spectrum of language models for conversational AI and multi-step reasoning, from ultra-lightweight edge models (0.6B) to frontier reasoning systems (1M context). Built on the Zen MoDE architecture (Mixture of Diverse/Distilled Experts), these models deliver strong performance across general chat, specialized reasoning, and domain-specific tasks.

Model Gallery

Model	Parameters	Context	Weights	Paper
Zen 5	—	1M	weights	paper
Zen 5 Pro	—	512K	weights	paper
Zen 5 Pro (GGUF)	284B / 37B active	1M	weights	paper
Zen 5 Mini	—	256K	weights	paper
Zen 5 Flash	4.02B	32K	weights	paper
Zen 5 Max	MoE (IQ2_XXS quant)	1M	weights	paper
Zen Pro	8.19B	128K	weights	paper
Zen Blog	8.19B	128K	weights	paper
Zen Multilingual	8B	128K	weights	paper
Zen3 Nano	8.19B	40K	weights	paper
Zen Scribe	2.35B	32K	weights	paper
Zen Eco 4B Instruct	4.02B	32K	weights	paper
Zen Eco 4B Thinking	4.02B	32K	weights	paper
Zen Eco Instruct	4B	32K	weights	paper
Zen Eco	0.75B	32K	weights	paper
Zen Nano 0.6B	0.6B	32K	weights	paper
Zen Nano	0.6B	32K	weights	paper

Quick Start

Local Inference with Transformers

Use any model from the table above. Here's an example with Zen Eco 4B Instruct:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "zenlm/zen-eco-4b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain quantum computing in one sentence."}]
text = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Using the Zen API

For production deployments, use the OpenAI-compatible API at api.hanzo.ai:

curl https://api.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer $HANZO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen-eco-4b-instruct",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "max_tokens": 512
  }'

Get your API key at console.hanzo.ai — includes $5 free credit on signup.

Model Selection Guide

For Edge & Mobile: Choose Zen Nano (0.6B) or Zen Eco (0.75–4B) for on-device deployment with minimal latency.

For Speed & Throughput: Zen 5 Flash (4.02B, 32K) delivers sub-100ms first-token-to-finish for high-volume routing.

For Reasoning & Depth: Zen 5 Pro (512K context) or Zen 5 (1M context) for complex multi-step tasks and long-document understanding.

For Content Generation: Zen Blog and Zen Scribe (2–8B) are tuned for structured writing and article generation.

For Multilingual: Zen Multilingual (8B, 128K) covers 100+ languages with strong cross-lingual understanding.

Architecture

All Zen chat models are built on Zen MoDE, a modern Mixture of Experts architecture featuring:

Sparse Activation: Active parameters scale independently of total capacity
Extended Context: Up to 1M tokens for frontier models
Grouped Query Attention (GQA): Efficient inference without sacrificing quality
Multi-lingual: Strong performance across 100+ languages
Apache 2.0 License: Download, fine-tune, and deploy commercially

Format Support

All models are available in multiple formats for flexible deployment:

SafeTensors (primary) — Full precision (bfloat16) for training and fine-tuning
GGUF (quantized) — Q4_K_M, Q5_K_M, Q8_0, F16 for CPU and edge inference
MLX (Apple Silicon) — Metal-accelerated inference on M1/M2/M3 chips

Chat & Reasoning

On this page