GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
Summary
GRASPrune is a structured pruning framework designed to reduce the serving costs of large language models (LLMs) by jointly pruning feed-forward network (FFN) channels and KV head groups. Applied post-pretraining, GRASPrune operates under a single global budget, learning lightweight gate scores with a projected straight-through estimator to enforce a hard mask at each step while keeping backbone weights frozen. After mask fixation, the framework calibrates and folds scaling factors into the pruned weights, yielding a smaller, dense checkpoint without additional inference parameters. For example, on LLaMA-2-7B, GRASPrune removed 50% of parameters, achieving 12.18 perplexity on WikiText-2 and competitive zero-shot accuracy across five benchmarks. This process required only four epochs on 512 unlabeled calibration sequences using a single NVIDIA A100 80GB GPU, without full model fine-tuning.
Key takeaway
For AI Engineers optimizing LLM deployment costs, GRASPrune offers a method to significantly reduce model size and memory footprint without extensive fine-tuning. You can achieve substantial parameter reduction, like 50% on LLaMA-2-7B, using minimal calibration data and a single A100 GPU, leading to more efficient inference and lower serving expenses.
Key insights
GRASPrune enables efficient LLM serving by structured pruning with a global budget and post-pruning calibration.
Principles
- Prune FFN channels and KV head groups jointly.
- Enforce budget constraints during learning, not after.
Method
GRASPrune learns gate scores with a projected straight-through estimator to enforce a hard mask, then calibrates and folds scaling factors into pruned weights to create a smaller, dense model.
In practice
- Apply GRASPrune to LLaMA-2-7B for 50% parameter reduction.
- Utilize 512 unlabeled sequences for calibration.
Topics
- GRASPrune
- Structured Pruning
- Large Language Models
- Global Gating
- FFN Channels Pruning
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.