Zen LM
Models

Audio & Speech

Advanced speech recognition, text-to-speech, speech-to-speech dubbing, and generative audio synthesis models from the Zen family.

Overview

The Zen Audio & Speech model family provides production-grade tools for speech recognition, synthesis, dubbing, and audio generation. From compact on-device models (0.6B parameters) to full-featured systems, these models power voice agents, real-time transcription, voice cloning, and creative audio synthesis. All models support Apache 2.0 licensing and are optimized for deployment across diverse hardware.

Model Family

ModelParametersContextCapabilitiesHFPaper
Zen3 ASR2.35BHigh-accuracy streaming speech-to-textweightspaper
Zen3 ASR 0.6B0.94BLightweight on-device speech recognitionweightspaper
Zen3 ASR Forced Aligner0.92BWord & phoneme-level time alignmentweightspaper
Zen3 TTS1.93BNatural speech synthesis with voice cloningweightspaper
Zen3 TTS 0.6B0.91BEfficient on-device text-to-speechweightspaper
Zen3 TTS Custom Voice1.92BVoice cloning from audio samplesweightspaper
Zen3 TTS Voice Design1.92BCreate voices from text descriptorsweightspaper
Zen Dub Live1.93B30s audioReal-time speech-to-speech dubbingweightspaper
Zen Foley1B10s audioText-to-audio for sound effectsweightspaper
Zen Musician6.22B30s audioText-to-music generationweightspaper

Use Cases

  • Speech-to-Text: Transcribe audio with high accuracy in real-time using Zen3 ASR
  • Text-to-Speech: Synthesize natural voices in English and Chinese with voice cloning
  • Dubbing & Localization: Real-time speech-to-speech conversion maintaining voice identity
  • Audio Generation: Create music and foley sound effects from text prompts
  • Voice Alignment: Generate precise time-codes for subtitles and media sync
  • Voice Design: Create synthetic voices from text descriptors without reference audio

Quick Start

Speech Recognition (ASR)

import torch
from transformers import AutoModel, AutoTokenizer

# Load Zen3 ASR for transcription
model_id = "zenlm/zen-3-asr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Example: transcribe audio
# (Use librosa or torchaudio to load audio and preprocess)
import librosa
import numpy as np

audio_path = "example.wav"
audio, sr = librosa.load(audio_path, sr=16000)

# Model expects audio input; process with feature extraction
# See model card for preprocessing details

Text-to-Speech (TTS)

from transformers import AutoModel, AutoTokenizer

# Load Zen3 TTS for speech synthesis
model_id = "zenlm/zen-3-tts"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

# Example: synthesize speech from text
text = "Hello, this is natural speech synthesis."
inputs = tokenizer(text, return_tensors="pt")

# Generate audio output
# (See model card for audio decoding)
with torch.no_grad():
    outputs = model.generate(**inputs)

Using the Zen API

For production deployments, use the OpenAI-compatible Zen API endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="your-zen-api-key",
    base_url="https://api.hanzo.ai/v1"
)

# Speech-to-Text via API
audio_response = client.audio.transcriptions.create(
    model="zen-3-asr",
    file=open("audio.wav", "rb"),
)
print(audio_response.text)

# Text-to-Speech via API
speech_response = client.audio.speech.create(
    model="zen-3-tts",
    voice="alloy",
    input="Hello, world!",
)

Deployment Options

Local Inference with Transformers

All Zen audio models are compatible with HuggingFace Transformers. Download from the model hub and run locally with full control:

# Install dependencies
pip install transformers torch librosa torchaudio

# Load and run any model
python -c "from transformers import AutoModel; m = AutoModel.from_pretrained('zenlm/zen-3-asr')"

Edge & On-Device

The 0.6B and 1B variants are optimized for edge deployment:

  • Zen3 ASR 0.6B and Zen3 TTS 0.6B for fast on-device transcription & synthesis
  • Quantize with GGUF or MLX for Apple Silicon
  • Run on Raspberry Pi, mobile, and IoT hardware

Cloud API

Use the OpenAI-compatible endpoint at api.hanzo.ai for:

  • Managed inference with auto-scaling
  • Minimal latency (sub-100ms for most tasks)
  • Batch processing and streaming support
  • No model weights to manage locally

Model Selection Guide

  • Best Transcription: Zen3 ASR (2.35B) for accuracy, or 0.6B for speed
  • Best Voice Synthesis: Zen3 TTS with voice cloning for natural results
  • Best Dubbing: Zen Dub Live for real-time speech-to-speech
  • Best Music Generation: Zen Musician for up to 30 seconds of music
  • Best on-Device: Zen3 ASR 0.6B or Zen3 TTS 0.6B for compact deployments

Training & Fine-tuning

All Zen audio models are available for fine-tuning. The architecture supports:

  • Speaker adaptation for custom voices
  • Domain-specific ASR training
  • Language expansion
  • Accent and emotion tuning

See the Zen Training repository for recipes and best practices.

Performance Benchmarks

ModelSpeedAccuracyQualityDevice
Zen3 ASR~5x real-time95%+ WERReferenceServer
Zen3 ASR 0.6B~2x real-time90%+ WERGoodEdge
Zen3 TTS~100ms/secNaturalServer
Zen3 TTS 0.6B~50ms/secGoodDevice
Zen Dub LiveReal-timeNaturalGPU
Zen Musician~5s for 30sHigh-fidelityGPU

License & Attribution

All Zen audio models are released under the Apache 2.0 license, permitting commercial use, modification, and redistribution. When using these models, please cite the relevant whitepapers linked in the Model Family table above and attribute the Zen team. Model weights are available on Hugging Face and source code is hosted in the Zen repository.

On this page