Two-Stage Regularization-Based Structured Pruning for LLMs
Summary
ELDeR (Efficient LLMs through Data-Driven Regularized Layer-wise Pruning) is a novel pruning paradigm designed to reduce the computational and memory costs of Large Language Models (LLMs) without significant performance degradation. Unlike traditional prune-then-finetune methods that often require costly recovery fine-tuning (RFT), ELDeR employs a regularization-then-prune approach. It iteratively learns weights for each transformer layer using a small dataset, then applies $\ell_{1}$-norm or $\ell_{2}$-norm regularization to the difference between the input and output of layers with smaller weights. This process forces information transfer to remaining layers, minimizing loss. Experiments on LLaMA2-7B, LLaMA2-13B, LLaMA3-8B, OPT-2.7B, OPT-13B, and Phi-2 models show ELDeR achieves superior perplexity and accuracy compared to other layer-wise structured pruning methods like SLEB, ShortGPT, and LaCo, while significantly reducing RFT computational costs. For instance, ELDeR reduced LLaMA2-7B's perplexity by 20% compared to ShortGPT and achieved a 75% throughput increase and 46% latency reduction on OPT-13B at a 50% pruning ratio.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM deployment, ELDeR offers a compelling alternative to traditional pruning. By adopting its regularization-then-prune paradigm, you can achieve significant model compression and acceleration (e.g., 1.75x throughput on OPT-13B) while maintaining high performance across generation and zero-shot tasks, often without the need for extensive recovery fine-tuning. This approach reduces computational overhead and enables more efficient resource utilization for deploying large models.
Key insights
Regularization before pruning can effectively transfer information, preserving LLM performance and reducing fine-tuning needs.
Principles
- Iterative layer weight learning is crucial for pruning effectiveness.
- Regularization can mitigate information loss during pruning.
- High input-output similarity in layers facilitates information transfer.
Method
ELDeR iteratively learns layer weights with $\ell_{1}$-norm loss, then applies $\ell_{1}$-norm or $\ell_{2}$-norm regularization to the input-output difference of low-weight layers, followed by pruning.
In practice
- Use small datasets (e.g., 128 samples) for layer weight learning.
- Prioritize iterative over one-shot layer weight learning for better performance.
- Consider $\ell_{1}$-norm or $\ell_{2}$-norm for regularization in pruning.
Topics
- Large Language Models
- Structured Pruning
- Layer-wise Pruning
- Regularization
- Model Compression
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.