I Pruned Half of FinBERT’s Attention Heads — and It Got Better
Summary
A compression study on FinBERT, a BERT-base model pre-trained on 4.9B tokens of financial text, revealed that pruning nearly half of its attention heads (70 out of 144) unexpectedly improved its Macro F1 score from 0.8876 to 0.8966 on a 3-class sentiment task. The study applied five compression techniques: knowledge distillation (Vanilla KD and Intermediate KD), INT8 quantization (PTQ and QAT), and structured attention pruning at 30% and 50% sparsity. While distillation and quantization significantly reduced model size (to 76 MB and 48 MB respectively) and CPU latency (to ~12-20ms) at the cost of ~13 F1 points, the pruned models, despite retaining their 438 MB size and similar latency (~264-266ms), surprisingly outperformed the original teacher model in accuracy. This improvement is attributed to the removal of low-entropy, redundant attention heads, which reduced noise and acted as a form of structured regularization.
Key takeaway
For NLP Engineers optimizing financial sentiment models, you should audit your attention heads using entropy scoring before committing to expensive compression. Iterative pruning, specifically a prune-recover loop, can not only reduce model complexity but also enhance accuracy by removing noisy, redundant heads. If you're implementing knowledge distillation, ensure you apply T² scaling to the KL loss to maintain gradient magnitude, especially when using high temperatures.
Key insights
Pruning redundant attention heads can improve model accuracy, especially for narrow tasks.
Principles
- Over-parameterization can introduce noise, not just redundancy.
- Iterative pruning with recovery fine-tuning is crucial for performance.
- Simpler KD objectives often outperform complex ones for divergent architectures.
Method
An iterative pruning algorithm: score attention heads by entropy-based importance, remove lowest-importance N%, then fine-tune for 3 recovery epochs, repeating the cycle.
In practice
- Compute head importance via entropy for early layer redundancy detection.
- Apply T² scaling to KL loss in distillation to preserve gradient magnitude.
- Consider Vanilla KD over Intermediate KD for architecturally dissimilar student models.
Topics
- FinBERT
- Attention Head Pruning
- Model Compression
- Knowledge Distillation
- Financial NLP
Code references
Best for: NLP Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.