Zen LM
Models

Chat & Reasoning

Zen chat and reasoning models ranging from 0.6B edge deployments to frontier 1M-context reasoning systems.

Chat & Reasoning Models

The Zen Chat & Reasoning family provides a complete spectrum of language models for conversational AI and multi-step reasoning, from ultra-lightweight edge models (0.6B) to frontier reasoning systems (1M context). Built on the Zen MoDE architecture (Mixture of Diverse/Distilled Experts), these models deliver strong performance across general chat, specialized reasoning, and domain-specific tasks.

ModelParametersContextWeightsPaper
Zen 51Mweightspaper
Zen 5 Pro512Kweightspaper
Zen 5 Pro (GGUF)284B / 37B active1Mweightspaper
Zen 5 Mini256Kweightspaper
Zen 5 Flash4.02B32Kweightspaper
Zen 5 MaxMoE (IQ2_XXS quant)1Mweightspaper
Zen Pro8.19B128Kweightspaper
Zen Blog8.19B128Kweightspaper
Zen Multilingual8B128Kweightspaper
Zen3 Nano8.19B40Kweightspaper
Zen Scribe2.35B32Kweightspaper
Zen Eco 4B Instruct4.02B32Kweightspaper
Zen Eco 4B Thinking4.02B32Kweightspaper
Zen Eco Instruct4B32Kweightspaper
Zen Eco0.75B32Kweightspaper
Zen Nano 0.6B0.6B32Kweightspaper
Zen Nano0.6B32Kweightspaper

Quick Start

Local Inference with Transformers

Use any model from the table above. Here's an example with Zen Eco 4B Instruct:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "zenlm/zen-eco-4b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain quantum computing in one sentence."}]
text = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Using the Zen API

For production deployments, use the OpenAI-compatible API at api.hanzo.ai:

curl https://api.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer $HANZO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen-eco-4b-instruct",
    "messages": [
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "max_tokens": 512
  }'

Get your API key at console.hanzo.ai — includes $5 free credit on signup.

Model Selection Guide

For Edge & Mobile: Choose Zen Nano (0.6B) or Zen Eco (0.75–4B) for on-device deployment with minimal latency.

For Speed & Throughput: Zen 5 Flash (4.02B, 32K) delivers sub-100ms first-token-to-finish for high-volume routing.

For Reasoning & Depth: Zen 5 Pro (512K context) or Zen 5 (1M context) for complex multi-step tasks and long-document understanding.

For Content Generation: Zen Blog and Zen Scribe (2–8B) are tuned for structured writing and article generation.

For Multilingual: Zen Multilingual (8B, 128K) covers 100+ languages with strong cross-lingual understanding.

Architecture

All Zen chat models are built on Zen MoDE, a modern Mixture of Experts architecture featuring:

  • Sparse Activation: Active parameters scale independently of total capacity
  • Extended Context: Up to 1M tokens for frontier models
  • Grouped Query Attention (GQA): Efficient inference without sacrificing quality
  • Multi-lingual: Strong performance across 100+ languages
  • Apache 2.0 License: Download, fine-tune, and deploy commercially

Format Support

All models are available in multiple formats for flexible deployment:

  • SafeTensors (primary) — Full precision (bfloat16) for training and fine-tuning
  • GGUF (quantized) — Q4_K_M, Q5_K_M, Q8_0, F16 for CPU and edge inference
  • MLX (Apple Silicon) — Metal-accelerated inference on M1/M2/M3 chips

On this page