Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models
Summary
Causal Attribution Pruning (CAP) is a novel, training-free method designed to preserve reasoning performance in large language models during compression. CAP identifies critical attention heads by measuring their causal impact on reasoning tasks, estimating performance degradation when a head is masked on a small calibration set. These head-level causal scores are then converted into weight-level importance values to guide fine-grained weight pruning. Evaluated on Llama-3-8B-Instruct and Mistral-7B-Instruct across GSM8K, StrategyQA, and ARC-Challenge benchmarks at 10%, 20%, and 50% sparsity, CAP demonstrated significant improvements. Notably, it achieved up to 61% relative accuracy gains over the Wanda baseline on ARC-Challenge for Llama-3 at 20% sparsity. While effective at moderate sparsity (10-20%), CAP faces limitations at 50% sparsity due to coarse MLP attribution and shows weaker transfer to Mixture-of-Experts architectures.
Key takeaway
For Machine Learning Engineers optimizing LLMs for reasoning tasks, you should consider Causal Attribution Pruning (CAP) for moderate compression. CAP significantly outperforms correlational methods like Wanda at 10-20% sparsity, preserving reasoning accuracy on benchmarks like ARC-Challenge. However, avoid CAP for sparsity above 40% or with Mixture-of-Experts architectures, as coarse MLP attribution can cause model collapse. Prioritize task-aligned calibration for optimal results.
Key insights
Causal Attribution Pruning (CAP) uses interventional head masking to identify and protect critical attention heads for reasoning performance.
Principles
- Causal attribution via masking directly quantifies functional contribution.
- Reasoning-focused calibration aligns pruning with target capabilities.
- Weight-level pruning with head-level scores preserves fine-grained control.
Method
CAP measures expected loss increase when masking attention heads on a calibration set, converts scores to weight importance, then prunes by importance-weighted magnitude.
In practice
- Use CAP for 10-20% sparsity to preserve LLM reasoning.
- Calibrate pruning on task-specific data for better alignment.
- Employ median aggregation for robust causal score estimation.
Topics
- Causal Attribution Pruning
- LLM Pruning
- Attention Heads
- Reasoning Benchmarks
- Llama-3-8B-Instruct
- Mistral-7B-Instruct
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.