Rescaling MLM-Head for Neural Sparse Retrieval
Summary
Learned sparse retrieval (LSR) models like SPLADE, which use BERT-style masked language models (MLM) as backbone encoders, often suffer performance degradation and training collapse when stronger pretrained encoders are used. This issue stems from a scale mismatch in the MLM head, where inflated L2 norms amplify sparse activations, distort matching scores, and destabilize contrastive training. Researchers introduce a zero-cost initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This adjustment significantly improves training stability for large-norm backbones such as ModernBERT and Ettin, turning unstable runs into competitive sparse retrievers. The corrected models frequently match or surpass the classic BERT-SPLADE baseline on both in-domain and out-of-domain retrieval benchmarks, indicating that MLM-head scale calibration, not just encoder capacity, is a critical bottleneck.
Key takeaway
For Machine Learning Engineers adapting strong pretrained encoders for learned sparse retrieval, you should implement an initialization-time rescaling of the MLM-head projection. This zero-cost adjustment prevents training collapse and performance degradation, enabling models like ModernBERT and Ettin to achieve competitive sparse retrieval effectiveness. Your focus should extend beyond just encoder capacity to include careful calibration of the MLM-head scale for robust performance.
Key insights
MLM-head scale mismatch causes performance issues in learned sparse retrieval; a simple rescaling fixes it.
Principles
- MLM-head L2 norm impacts sparse retrieval stability.
- Scale mismatch can distort matching scores.
- Encoder capacity isn't the sole bottleneck for LSR.
Method
Rescale the MLM-head projection by a constant factor at initialization before SPLADE training to correct scale mismatch and improve stability.
In practice
- Apply initialization-time MLM-head rescaling.
- Use with ModernBERT or Ettin backbones.
- Improve SPLADE training stability.
Topics
- Learned Sparse Retrieval
- SPLADE
- Masked Language Models
- Neural Information Retrieval
- Model Initialization
- Encoder Backbones
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.