Introducing the Ettin Reranker Family
Summary
Hugging Face has released the Ettin Reranker family, comprising six new Sentence Transformers CrossEncoder models ranging from 17 million to 1 billion parameters, published on May 19, 2026. These models, built upon Ettin ModernBERT encoders, achieve state-of-the-art performance at their respective sizes on MTEB(eng, v2) Retrieval and NanoBEIR benchmarks. The smallest 17M model surpasses the 33M "ms-marco-MiniLM-L12-v2" by +0.051 NDCG@10 on MTEB, while the 1B model closely matches its 1.54B teacher, "mxbai-rerank-large-v2", within 0.0001 NDCG@10. The rerankers were trained using a pointwise MSE distillation recipe on a ~143M "(query, document, teacher_score)" dataset. They also demonstrate significant speed improvements, with the 17M model processing 7517 pairs per second on an NVIDIA H100 80GB, benefiting from "bfloat16" precision and unpadded Flash Attention 2 for 1.7x-8.3x speedups. All models are released under the Apache 2.0 license.
Key takeaway
For AI Engineers optimizing search or RAG systems, you should consider replacing your current cross-encoders with the Ettin Reranker family. These models offer superior accuracy and significantly faster inference, especially when configured with "bfloat16" and Flash Attention 2. Swapping out legacy MiniLM rerankers for the 17M or 32M Ettin models provides a low-risk, high-impact upgrade to both latency and search quality in your retrieve-then-rerank pipelines.
Key insights
Ettin Rerankers offer state-of-the-art accuracy and speed across various sizes via distillation and optimized architecture.
Principles
- Distillation from strong teachers can yield smaller, faster models with comparable performance.
- Unpadded Flash Attention 2 significantly boosts throughput and reduces memory for Transformer models.
- Cross-encoders enhance retrieval accuracy by jointly encoding query-document pairs.
Method
Pointwise MSE distillation trains smaller cross-encoders by matching raw logits from a larger teacher model ("mxbai-rerank-large-v2") on a diverse ~143M "(query, document, score)" dataset.
In practice
- Implement retrieve-then-rerank pipelines for improved search accuracy.
- Use "bfloat16" and Flash Attention 2 for optimal reranker inference speed.
- Swap legacy MiniLM rerankers with Ettin 17M or 32M for better quality and latency.
Topics
- Ettin Reranker
- Cross-Encoder Models
- Model Distillation
- Information Retrieval
- Flash Attention 2
- Sentence Transformers
Code references
- huggingface/blog
- huggingface/sentence-transformers
- huggingface/kernels
- embeddings-benchmark/mteb
- beir-cellar/beir
Best for: NLP Engineer, AI Architect, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.