I Pruned Half of FinBERT’s Attention Heads — and It Got Better

2026-03-06 · Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Model Optimization · Depth: Advanced, medium

Summary

A compression study on FinBERT, a BERT-base model pre-trained on 4.9B tokens of financial text, revealed that pruning nearly half of its attention heads (70 out of 144) unexpectedly improved its Macro F1 score from 0.8876 to 0.8966 on a 3-class sentiment task. The study applied five compression techniques: knowledge distillation (Vanilla KD and Intermediate KD), INT8 quantization (PTQ and QAT), and structured attention pruning at 30% and 50% sparsity. While distillation and quantization significantly reduced model size (to 76 MB and 48 MB respectively) and CPU latency (to ~12-20ms) at the cost of ~13 F1 points, the pruned models, despite retaining their 438 MB size and similar latency (~264-266ms), surprisingly outperformed the original teacher model in accuracy. This improvement is attributed to the removal of low-entropy, redundant attention heads, which reduced noise and acted as a form of structured regularization.

Key takeaway

For NLP Engineers optimizing financial sentiment models, you should audit your attention heads using entropy scoring before committing to expensive compression. Iterative pruning, specifically a prune-recover loop, can not only reduce model complexity but also enhance accuracy by removing noisy, redundant heads. If you're implementing knowledge distillation, ensure you apply T² scaling to the KL loss to maintain gradient magnitude, especially when using high temperatures.

Key insights

Pruning redundant attention heads can improve model accuracy, especially for narrow tasks.

Principles

Over-parameterization can introduce noise, not just redundancy.
Iterative pruning with recovery fine-tuning is crucial for performance.
Simpler KD objectives often outperform complex ones for divergent architectures.

Method

An iterative pruning algorithm: score attention heads by entropy-based importance, remove lowest-importance N%, then fine-tune for 3 recovery epochs, repeating the cycle.

In practice

Compute head importance via entropy for early layer redundancy detection.
Apply T² scaling to KL loss in distillation to preserve gradient magnitude.
Consider Vanilla KD over Intermediate KD for architecturally dissimilar student models.

Topics

FinBERT
Attention Head Pruning
Model Compression
Knowledge Distillation
Financial NLP

Code references

Rohanjain2312/FinCompress

Best for: NLP Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.