Introducing the Ettin Reranker Family

2026-05-19 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, extended

Summary

Hugging Face has released the Ettin Reranker family, comprising six new Sentence Transformers CrossEncoder models ranging from 17 million to 1 billion parameters, published on May 19, 2026. These models, built upon Ettin ModernBERT encoders, achieve state-of-the-art performance at their respective sizes on MTEB(eng, v2) Retrieval and NanoBEIR benchmarks. The smallest 17M model surpasses the 33M "ms-marco-MiniLM-L12-v2" by +0.051 NDCG@10 on MTEB, while the 1B model closely matches its 1.54B teacher, "mxbai-rerank-large-v2", within 0.0001 NDCG@10. The rerankers were trained using a pointwise MSE distillation recipe on a ~143M "(query, document, teacher_score)" dataset. They also demonstrate significant speed improvements, with the 17M model processing 7517 pairs per second on an NVIDIA H100 80GB, benefiting from "bfloat16" precision and unpadded Flash Attention 2 for 1.7x-8.3x speedups. All models are released under the Apache 2.0 license.

Key takeaway

For AI Engineers optimizing search or RAG systems, you should consider replacing your current cross-encoders with the Ettin Reranker family. These models offer superior accuracy and significantly faster inference, especially when configured with "bfloat16" and Flash Attention 2. Swapping out legacy MiniLM rerankers for the 17M or 32M Ettin models provides a low-risk, high-impact upgrade to both latency and search quality in your retrieve-then-rerank pipelines.

Key insights

Ettin Rerankers offer state-of-the-art accuracy and speed across various sizes via distillation and optimized architecture.

Principles

Distillation from strong teachers can yield smaller, faster models with comparable performance.
Unpadded Flash Attention 2 significantly boosts throughput and reduces memory for Transformer models.
Cross-encoders enhance retrieval accuracy by jointly encoding query-document pairs.

Method

Pointwise MSE distillation trains smaller cross-encoders by matching raw logits from a larger teacher model ("mxbai-rerank-large-v2") on a diverse ~143M "(query, document, score)" dataset.

In practice

Implement retrieve-then-rerank pipelines for improved search accuracy.
Use "bfloat16" and Flash Attention 2 for optimal reranker inference speed.
Swap legacy MiniLM rerankers with Ettin 17M or 32M for better quality and latency.

Topics

Ettin Reranker
Cross-Encoder Models
Model Distillation
Information Retrieval
Flash Attention 2
Sentence Transformers

Code references

Best for: NLP Engineer, AI Architect, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.