Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Causal Attribution Pruning (CAP) is a novel, training-free method designed to preserve reasoning performance in large language models during compression. CAP identifies critical attention heads by measuring their causal impact on reasoning tasks, estimating performance degradation when a head is masked on a small calibration set. These head-level causal scores are then converted into weight-level importance values to guide fine-grained weight pruning. Evaluated on Llama-3-8B-Instruct and Mistral-7B-Instruct across GSM8K, StrategyQA, and ARC-Challenge benchmarks at 10%, 20%, and 50% sparsity, CAP demonstrated significant improvements. Notably, it achieved up to 61% relative accuracy gains over the Wanda baseline on ARC-Challenge for Llama-3 at 20% sparsity. While effective at moderate sparsity (10-20%), CAP faces limitations at 50% sparsity due to coarse MLP attribution and shows weaker transfer to Mixture-of-Experts architectures.

Key takeaway

For Machine Learning Engineers optimizing LLMs for reasoning tasks, you should consider Causal Attribution Pruning (CAP) for moderate compression. CAP significantly outperforms correlational methods like Wanda at 10-20% sparsity, preserving reasoning accuracy on benchmarks like ARC-Challenge. However, avoid CAP for sparsity above 40% or with Mixture-of-Experts architectures, as coarse MLP attribution can cause model collapse. Prioritize task-aligned calibration for optimal results.

Key insights

Causal Attribution Pruning (CAP) uses interventional head masking to identify and protect critical attention heads for reasoning performance.

Principles

Causal attribution via masking directly quantifies functional contribution.
Reasoning-focused calibration aligns pruning with target capabilities.
Weight-level pruning with head-level scores preserves fine-grained control.

Method

CAP measures expected loss increase when masking attention heads on a calibration set, converts scores to weight importance, then prunes by importance-weighted magnitude.

In practice

Use CAP for 10-20% sparsity to preserve LLM reasoning.
Calibrate pruning on task-specific data for better alignment.
Employ median aggregation for robust causal score estimation.

Topics

Causal Attribution Pruning
LLM Pruning
Attention Heads
Reasoning Benchmarks
Llama-3-8B-Instruct
Mistral-7B-Instruct

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.