Embeddings & Retrieval
State-of-the-art text embeddings for semantic search, RAG, and retrieval pipelines across multiple scales.
Overview
The Zen Embeddings & Retrieval model family provides high-quality multilingual text embeddings for semantic search, retrieval-augmented generation (RAG), and similarity matching. Built on the Zen MoDE (Mixture of Distilled Experts) architecture, these models offer flexible trade-offs between accuracy and latency across 0.6B to 8B parameters, with context windows from 8K to 32K tokens.
All models in this family are Apache 2.0 licensed and available on Hugging Face.
Model Family
| Model | Params | Context | HF | Paper |
|---|---|---|---|---|
| Zen Embedding | 7.57B | 8K | weights | paper |
| Zen Embedding 0.6B | 0.6B | 32K | weights | paper |
| Zen Embedding 0.6B (GGUF) | 0.6B | 8K | weights | paper |
| Zen Embedding 4B | 4.02B | 32K | weights | paper |
| Zen Embedding 8B | 7.57B | 32K | weights | paper |
| Zen Embedding 8B (GGUF) | 8B | 8K | weights | paper |
Quick Start
Using Sentence Transformers (CPU / GPU)
from sentence_transformers import SentenceTransformer
# Load a Zen embedding model
model = SentenceTransformer("zenlm/zen-embedding-4B")
# Encode texts
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, embedding_dim)
# Compute similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)Using Transformers (Feature Extraction)
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "zenlm/zen-embedding-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Encode text
texts = ["Your text here", "Another example"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0, :] # CLS token
print(embeddings.shape)Using Zen API (Cloud)
For managed embeddings via the Zen API, use the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
base_url="https://api.hanzo.ai/v1",
api_key="your-api-key"
)
response = client.embeddings.create(
model="zen-embedding-4B",
input="Your text here"
)
embedding = response.data[0].embedding
print(len(embedding))Model Selection Guide
- Zen Embedding 0.6B: Edge devices, low-latency retrieval, on-device deployment
- Zen Embedding 4B: Balanced quality and speed, production RAG systems, semantic search
- Zen Embedding 8B: High-accuracy retrieval, demanding workloads, fine-grained semantic matching
- GGUF variants: CPU inference with llama.cpp, quantized edge deployment
All models support multilingual text and produce dense embeddings suitable for vector databases (Pinecone, Weaviate, Milvus, Qdrant, etc.).
Features
- Multilingual: Strong performance across 100+ languages
- Long context: 8K–32K token support for document-level embeddings
- Mixture of Experts: Efficient sparse computation in the 4B and 8B variants
- Multiple formats: SafeTensors, GGUF (quantized), compatible with sentence-transformers
- Apache 2.0: Fully open-source for commercial and research use
Integration with RAG
Zen embeddings work seamlessly in RAG pipelines:
# Example: retrieve relevant documents for a query
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("zenlm/zen-embedding-4B")
# Embed your documents once
documents = ["Doc 1 text", "Doc 2 text", "Doc 3 text"]
doc_embeddings = model.encode(documents)
# At query time, embed the question
query = "What is semantic search?"
query_embedding = model.encode(query)
# Compute similarities and rank
similarities = np.dot(query_embedding, doc_embeddings.T)
top_indices = np.argsort(similarities)[::-1][:3]
for idx in top_indices:
print(f"Rank {idx}: {documents[idx]}")Advanced Usage
Fine-tuning
Fine-tune a Zen embedding model on your domain-specific data:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model = SentenceTransformer("zenlm/zen-embedding-4B")
# Prepare training pairs
train_examples = [
InputExample(texts=["sentence1", "sentence2"], label=0.95),
InputExample(texts=["sentence3", "sentence4"], label=0.10),
]
train_loader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_loader, train_loss)],
epochs=1,
warmup_steps=100
)Quantization (GGUF)
Convert and quantize for CPU-only deployment:
# Convert to GGUF format
python -m llama_cpp.convert_hf_to_gguf --model-id zenlm/zen-embedding-4B
# Quantize
./llama-quantize zen-embedding-4b.gguf zen-embedding-4b-q4_k_m.gguf Q4_K_MPerformance
Zen embedding models are evaluated on standard benchmarks including MTEB (Massive Text Embedding Benchmark), achieving competitive performance in:
- Semantic textual similarity
- Retrieval-augmented generation
- Paraphrase detection
- Clustering accuracy
- Duplicate detection
See the Zen Embeddings & Retrieval whitepaper for detailed benchmark results.
License
All models in the Zen Embeddings family are released under the Apache 2.0 license, enabling unrestricted use in commercial and open-source projects.