Rescaling MLM-Head for Neural Sparse Retrieval

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Information Retrieval · Depth: Expert, quick

Summary

Learned sparse retrieval (LSR) models like SPLADE, which use BERT-style masked language models (MLM) as backbone encoders, often suffer performance degradation and training collapse when stronger pretrained encoders are used. This issue stems from a scale mismatch in the MLM head, where inflated L2 norms amplify sparse activations, distort matching scores, and destabilize contrastive training. Researchers introduce a zero-cost initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This adjustment significantly improves training stability for large-norm backbones such as ModernBERT and Ettin, turning unstable runs into competitive sparse retrievers. The corrected models frequently match or surpass the classic BERT-SPLADE baseline on both in-domain and out-of-domain retrieval benchmarks, indicating that MLM-head scale calibration, not just encoder capacity, is a critical bottleneck.

Key takeaway

For Machine Learning Engineers adapting strong pretrained encoders for learned sparse retrieval, you should implement an initialization-time rescaling of the MLM-head projection. This zero-cost adjustment prevents training collapse and performance degradation, enabling models like ModernBERT and Ettin to achieve competitive sparse retrieval effectiveness. Your focus should extend beyond just encoder capacity to include careful calibration of the MLM-head scale for robust performance.

Key insights

MLM-head scale mismatch causes performance issues in learned sparse retrieval; a simple rescaling fixes it.

Principles

MLM-head L2 norm impacts sparse retrieval stability.
Scale mismatch can distort matching scores.
Encoder capacity isn't the sole bottleneck for LSR.

Method

Rescale the MLM-head projection by a constant factor at initialization before SPLADE training to correct scale mismatch and improve stability.

In practice

Apply initialization-time MLM-head rescaling.
Use with ModernBERT or Ettin backbones.
Improve SPLADE training stability.

Topics

Learned Sparse Retrieval
SPLADE
Masked Language Models
Neural Information Retrieval
Model Initialization
Encoder Backbones

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.