Embeddings & Retrieval

State-of-the-art text embeddings for semantic search, RAG, and retrieval pipelines across multiple scales.

Overview

The Zen Embeddings & Retrieval model family provides high-quality multilingual text embeddings for semantic search, retrieval-augmented generation (RAG), and similarity matching. Built on the Zen MoDE (Mixture of Distilled Experts) architecture, these models offer flexible trade-offs between accuracy and latency across 0.6B to 8B parameters, with context windows from 8K to 32K tokens.

All models in this family are Apache 2.0 licensed and available on Hugging Face.

Model Family

Model	Params	Context	HF	Paper
Zen Embedding	7.57B	8K	weights	paper
Zen Embedding 0.6B	0.6B	32K	weights	paper
Zen Embedding 0.6B (GGUF)	0.6B	8K	weights	paper
Zen Embedding 4B	4.02B	32K	weights	paper
Zen Embedding 8B	7.57B	32K	weights	paper
Zen Embedding 8B (GGUF)	8B	8K	weights	paper

Quick Start

Using Sentence Transformers (CPU / GPU)

from sentence_transformers import SentenceTransformer

# Load a Zen embedding model
model = SentenceTransformer("zenlm/zen-embedding-4B")

# Encode texts
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, embedding_dim)

# Compute similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Using Transformers (Feature Extraction)

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "zenlm/zen-embedding-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode text
texts = ["Your text here", "Another example"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token

print(embeddings.shape)

Using Zen API (Cloud)

For managed embeddings via the Zen API, use the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.hanzo.ai/v1",
    api_key="your-api-key"
)

response = client.embeddings.create(
    model="zen-embedding-4B",
    input="Your text here"
)

embedding = response.data[0].embedding
print(len(embedding))

Model Selection Guide

Zen Embedding 0.6B: Edge devices, low-latency retrieval, on-device deployment
Zen Embedding 4B: Balanced quality and speed, production RAG systems, semantic search
Zen Embedding 8B: High-accuracy retrieval, demanding workloads, fine-grained semantic matching
GGUF variants: CPU inference with llama.cpp, quantized edge deployment

All models support multilingual text and produce dense embeddings suitable for vector databases (Pinecone, Weaviate, Milvus, Qdrant, etc.).

Features

Multilingual: Strong performance across 100+ languages
Long context: 8K–32K token support for document-level embeddings
Mixture of Experts: Efficient sparse computation in the 4B and 8B variants
Multiple formats: SafeTensors, GGUF (quantized), compatible with sentence-transformers
Apache 2.0: Fully open-source for commercial and research use

Integration with RAG

Zen embeddings work seamlessly in RAG pipelines:

# Example: retrieve relevant documents for a query
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("zenlm/zen-embedding-4B")

# Embed your documents once
documents = ["Doc 1 text", "Doc 2 text", "Doc 3 text"]
doc_embeddings = model.encode(documents)

# At query time, embed the question
query = "What is semantic search?"
query_embedding = model.encode(query)

# Compute similarities and rank
similarities = np.dot(query_embedding, doc_embeddings.T)
top_indices = np.argsort(similarities)[::-1][:3]

for idx in top_indices:
    print(f"Rank {idx}: {documents[idx]}")

Advanced Usage

Fine-tuning

Fine-tune a Zen embedding model on your domain-specific data:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("zenlm/zen-embedding-4B")

# Prepare training pairs
train_examples = [
    InputExample(texts=["sentence1", "sentence2"], label=0.95),
    InputExample(texts=["sentence3", "sentence4"], label=0.10),
]

train_loader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_loader, train_loss)],
    epochs=1,
    warmup_steps=100
)

Quantization (GGUF)

Convert and quantize for CPU-only deployment:

# Convert to GGUF format
python -m llama_cpp.convert_hf_to_gguf --model-id zenlm/zen-embedding-4B

# Quantize
./llama-quantize zen-embedding-4b.gguf zen-embedding-4b-q4_k_m.gguf Q4_K_M

Performance

Zen embedding models are evaluated on standard benchmarks including MTEB (Massive Text Embedding Benchmark), achieving competitive performance in:

Semantic textual similarity
Retrieval-augmented generation
Paraphrase detection
Clustering accuracy
Duplicate detection

See the Zen Embeddings & Retrieval whitepaper for detailed benchmark results.

License

All models in the Zen Embeddings family are released under the Apache 2.0 license, enabling unrestricted use in commercial and open-source projects.

Embeddings & Retrieval

On this page