Two-Stage Regularization-Based Structured Pruning for LLMs

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

ELDeR (Efficient LLMs through Data-Driven Regularized Layer-wise Pruning) is a novel pruning paradigm designed to reduce the computational and memory costs of Large Language Models (LLMs) without significant performance degradation. Unlike traditional prune-then-finetune methods that often require costly recovery fine-tuning (RFT), ELDeR employs a regularization-then-prune approach. It iteratively learns weights for each transformer layer using a small dataset, then applies $\ell_{1}$-norm or $\ell_{2}$-norm regularization to the difference between the input and output of layers with smaller weights. This process forces information transfer to remaining layers, minimizing loss. Experiments on LLaMA2-7B, LLaMA2-13B, LLaMA3-8B, OPT-2.7B, OPT-13B, and Phi-2 models show ELDeR achieves superior perplexity and accuracy compared to other layer-wise structured pruning methods like SLEB, ShortGPT, and LaCo, while significantly reducing RFT computational costs. For instance, ELDeR reduced LLaMA2-7B's perplexity by 20% compared to ShortGPT and achieved a 75% throughput increase and 46% latency reduction on OPT-13B at a 50% pruning ratio.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM deployment, ELDeR offers a compelling alternative to traditional pruning. By adopting its regularization-then-prune paradigm, you can achieve significant model compression and acceleration (e.g., 1.75x throughput on OPT-13B) while maintaining high performance across generation and zero-shot tasks, often without the need for extensive recovery fine-tuning. This approach reduces computational overhead and enables more efficient resource utilization for deploying large models.

Key insights

Regularization before pruning can effectively transfer information, preserving LLM performance and reducing fine-tuning needs.

Principles

Iterative layer weight learning is crucial for pruning effectiveness.
Regularization can mitigate information loss during pruning.
High input-output similarity in layers facilitates information transfer.

Method

ELDeR iteratively learns layer weights with $\ell_{1}$-norm loss, then applies $\ell_{1}$-norm or $\ell_{2}$-norm regularization to the input-output difference of low-weight layers, followed by pruning.

In practice

Use small datasets (e.g., 128 samples) for layer weight learning.
Prioritize iterative over one-shot layer weight learning for better performance.
Consider $\ell_{1}$-norm or $\ell_{2}$-norm for regularization in pruning.

Topics

Large Language Models
Structured Pruning
Layer-wise Pruning
Regularization
Model Compression

Code references

tatsu-lab/stanford_alpaca

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.