Audio & Speech
Advanced speech recognition, text-to-speech, speech-to-speech dubbing, and generative audio synthesis models from the Zen family.
Overview
The Zen Audio & Speech model family provides production-grade tools for speech recognition, synthesis, dubbing, and audio generation. From compact on-device models (0.6B parameters) to full-featured systems, these models power voice agents, real-time transcription, voice cloning, and creative audio synthesis. All models support Apache 2.0 licensing and are optimized for deployment across diverse hardware.
Model Family
| Model | Parameters | Context | Capabilities | HF | Paper |
|---|---|---|---|---|---|
| Zen3 ASR | 2.35B | — | High-accuracy streaming speech-to-text | weights | paper |
| Zen3 ASR 0.6B | 0.94B | — | Lightweight on-device speech recognition | weights | paper |
| Zen3 ASR Forced Aligner | 0.92B | — | Word & phoneme-level time alignment | weights | paper |
| Zen3 TTS | 1.93B | — | Natural speech synthesis with voice cloning | weights | paper |
| Zen3 TTS 0.6B | 0.91B | — | Efficient on-device text-to-speech | weights | paper |
| Zen3 TTS Custom Voice | 1.92B | — | Voice cloning from audio samples | weights | paper |
| Zen3 TTS Voice Design | 1.92B | — | Create voices from text descriptors | weights | paper |
| Zen Dub Live | 1.93B | 30s audio | Real-time speech-to-speech dubbing | weights | paper |
| Zen Foley | 1B | 10s audio | Text-to-audio for sound effects | weights | paper |
| Zen Musician | 6.22B | 30s audio | Text-to-music generation | weights | paper |
Use Cases
- Speech-to-Text: Transcribe audio with high accuracy in real-time using Zen3 ASR
- Text-to-Speech: Synthesize natural voices in English and Chinese with voice cloning
- Dubbing & Localization: Real-time speech-to-speech conversion maintaining voice identity
- Audio Generation: Create music and foley sound effects from text prompts
- Voice Alignment: Generate precise time-codes for subtitles and media sync
- Voice Design: Create synthetic voices from text descriptors without reference audio
Quick Start
Speech Recognition (ASR)
import torch
from transformers import AutoModel, AutoTokenizer
# Load Zen3 ASR for transcription
model_id = "zenlm/zen-3-asr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
# Example: transcribe audio
# (Use librosa or torchaudio to load audio and preprocess)
import librosa
import numpy as np
audio_path = "example.wav"
audio, sr = librosa.load(audio_path, sr=16000)
# Model expects audio input; process with feature extraction
# See model card for preprocessing detailsText-to-Speech (TTS)
from transformers import AutoModel, AutoTokenizer
# Load Zen3 TTS for speech synthesis
model_id = "zenlm/zen-3-tts"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
# Example: synthesize speech from text
text = "Hello, this is natural speech synthesis."
inputs = tokenizer(text, return_tensors="pt")
# Generate audio output
# (See model card for audio decoding)
with torch.no_grad():
outputs = model.generate(**inputs)Using the Zen API
For production deployments, use the OpenAI-compatible Zen API endpoint:
from openai import OpenAI
client = OpenAI(
api_key="your-zen-api-key",
base_url="https://api.hanzo.ai/v1"
)
# Speech-to-Text via API
audio_response = client.audio.transcriptions.create(
model="zen-3-asr",
file=open("audio.wav", "rb"),
)
print(audio_response.text)
# Text-to-Speech via API
speech_response = client.audio.speech.create(
model="zen-3-tts",
voice="alloy",
input="Hello, world!",
)Deployment Options
Local Inference with Transformers
All Zen audio models are compatible with HuggingFace Transformers. Download from the model hub and run locally with full control:
# Install dependencies
pip install transformers torch librosa torchaudio
# Load and run any model
python -c "from transformers import AutoModel; m = AutoModel.from_pretrained('zenlm/zen-3-asr')"Edge & On-Device
The 0.6B and 1B variants are optimized for edge deployment:
- Zen3 ASR 0.6B and Zen3 TTS 0.6B for fast on-device transcription & synthesis
- Quantize with GGUF or MLX for Apple Silicon
- Run on Raspberry Pi, mobile, and IoT hardware
Cloud API
Use the OpenAI-compatible endpoint at api.hanzo.ai for:
- Managed inference with auto-scaling
- Minimal latency (sub-100ms for most tasks)
- Batch processing and streaming support
- No model weights to manage locally
Model Selection Guide
- Best Transcription: Zen3 ASR (2.35B) for accuracy, or 0.6B for speed
- Best Voice Synthesis: Zen3 TTS with voice cloning for natural results
- Best Dubbing: Zen Dub Live for real-time speech-to-speech
- Best Music Generation: Zen Musician for up to 30 seconds of music
- Best on-Device: Zen3 ASR 0.6B or Zen3 TTS 0.6B for compact deployments
Training & Fine-tuning
All Zen audio models are available for fine-tuning. The architecture supports:
- Speaker adaptation for custom voices
- Domain-specific ASR training
- Language expansion
- Accent and emotion tuning
See the Zen Training repository for recipes and best practices.
Performance Benchmarks
| Model | Speed | Accuracy | Quality | Device |
|---|---|---|---|---|
| Zen3 ASR | ~5x real-time | 95%+ WER | Reference | Server |
| Zen3 ASR 0.6B | ~2x real-time | 90%+ WER | Good | Edge |
| Zen3 TTS | ~100ms/sec | — | Natural | Server |
| Zen3 TTS 0.6B | ~50ms/sec | — | Good | Device |
| Zen Dub Live | Real-time | — | Natural | GPU |
| Zen Musician | ~5s for 30s | — | High-fidelity | GPU |
License & Attribution
All Zen audio models are released under the Apache 2.0 license, permitting commercial use, modification, and redistribution. When using these models, please cite the relevant whitepapers linked in the Model Family table above and attribute the Zen team. Model weights are available on Hugging Face and source code is hosted in the Zen repository.