Zen LM
Models

Embeddings & Retrieval

State-of-the-art text embeddings for semantic search, RAG, and retrieval pipelines across multiple scales.

Overview

The Zen Embeddings & Retrieval model family provides high-quality multilingual text embeddings for semantic search, retrieval-augmented generation (RAG), and similarity matching. Built on the Zen MoDE (Mixture of Distilled Experts) architecture, these models offer flexible trade-offs between accuracy and latency across 0.6B to 8B parameters, with context windows from 8K to 32K tokens.

All models in this family are Apache 2.0 licensed and available on Hugging Face.

Model Family

ModelParamsContextHFPaper
Zen Embedding7.57B8Kweightspaper
Zen Embedding 0.6B0.6B32Kweightspaper
Zen Embedding 0.6B (GGUF)0.6B8Kweightspaper
Zen Embedding 4B4.02B32Kweightspaper
Zen Embedding 8B7.57B32Kweightspaper
Zen Embedding 8B (GGUF)8B8Kweightspaper

Quick Start

Using Sentence Transformers (CPU / GPU)

from sentence_transformers import SentenceTransformer

# Load a Zen embedding model
model = SentenceTransformer("zenlm/zen-embedding-4B")

# Encode texts
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, embedding_dim)

# Compute similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Using Transformers (Feature Extraction)

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "zenlm/zen-embedding-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode text
texts = ["Your text here", "Another example"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token

print(embeddings.shape)

Using Zen API (Cloud)

For managed embeddings via the Zen API, use the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.hanzo.ai/v1",
    api_key="your-api-key"
)

response = client.embeddings.create(
    model="zen-embedding-4B",
    input="Your text here"
)

embedding = response.data[0].embedding
print(len(embedding))

Model Selection Guide

  • Zen Embedding 0.6B: Edge devices, low-latency retrieval, on-device deployment
  • Zen Embedding 4B: Balanced quality and speed, production RAG systems, semantic search
  • Zen Embedding 8B: High-accuracy retrieval, demanding workloads, fine-grained semantic matching
  • GGUF variants: CPU inference with llama.cpp, quantized edge deployment

All models support multilingual text and produce dense embeddings suitable for vector databases (Pinecone, Weaviate, Milvus, Qdrant, etc.).

Features

  • Multilingual: Strong performance across 100+ languages
  • Long context: 8K–32K token support for document-level embeddings
  • Mixture of Experts: Efficient sparse computation in the 4B and 8B variants
  • Multiple formats: SafeTensors, GGUF (quantized), compatible with sentence-transformers
  • Apache 2.0: Fully open-source for commercial and research use

Integration with RAG

Zen embeddings work seamlessly in RAG pipelines:

# Example: retrieve relevant documents for a query
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("zenlm/zen-embedding-4B")

# Embed your documents once
documents = ["Doc 1 text", "Doc 2 text", "Doc 3 text"]
doc_embeddings = model.encode(documents)

# At query time, embed the question
query = "What is semantic search?"
query_embedding = model.encode(query)

# Compute similarities and rank
similarities = np.dot(query_embedding, doc_embeddings.T)
top_indices = np.argsort(similarities)[::-1][:3]

for idx in top_indices:
    print(f"Rank {idx}: {documents[idx]}")

Advanced Usage

Fine-tuning

Fine-tune a Zen embedding model on your domain-specific data:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("zenlm/zen-embedding-4B")

# Prepare training pairs
train_examples = [
    InputExample(texts=["sentence1", "sentence2"], label=0.95),
    InputExample(texts=["sentence3", "sentence4"], label=0.10),
]

train_loader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_loader, train_loss)],
    epochs=1,
    warmup_steps=100
)

Quantization (GGUF)

Convert and quantize for CPU-only deployment:

# Convert to GGUF format
python -m llama_cpp.convert_hf_to_gguf --model-id zenlm/zen-embedding-4B

# Quantize
./llama-quantize zen-embedding-4b.gguf zen-embedding-4b-q4_k_m.gguf Q4_K_M

Performance

Zen embedding models are evaluated on standard benchmarks including MTEB (Massive Text Embedding Benchmark), achieving competitive performance in:

  • Semantic textual similarity
  • Retrieval-augmented generation
  • Paraphrase detection
  • Clustering accuracy
  • Duplicate detection

See the Zen Embeddings & Retrieval whitepaper for detailed benchmark results.

License

All models in the Zen Embeddings family are released under the Apache 2.0 license, enabling unrestricted use in commercial and open-source projects.

On this page