GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

2026-04-21 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

GRASPrune is a structured pruning framework designed to reduce the serving costs of large language models (LLMs) by jointly pruning feed-forward network (FFN) channels and KV head groups. Applied post-pretraining, GRASPrune operates under a single global budget, learning lightweight gate scores with a projected straight-through estimator to enforce a hard mask at each step while keeping backbone weights frozen. After mask fixation, the framework calibrates and folds scaling factors into the pruned weights, yielding a smaller, dense checkpoint without additional inference parameters. For example, on LLaMA-2-7B, GRASPrune removed 50% of parameters, achieving 12.18 perplexity on WikiText-2 and competitive zero-shot accuracy across five benchmarks. This process required only four epochs on 512 unlabeled calibration sequences using a single NVIDIA A100 80GB GPU, without full model fine-tuning.

Key takeaway

For AI Engineers optimizing LLM deployment costs, GRASPrune offers a method to significantly reduce model size and memory footprint without extensive fine-tuning. You can achieve substantial parameter reduction, like 50% on LLaMA-2-7B, using minimal calibration data and a single A100 GPU, leading to more efficient inference and lower serving expenses.

Key insights

GRASPrune enables efficient LLM serving by structured pruning with a global budget and post-pruning calibration.

Principles

Prune FFN channels and KV head groups jointly.
Enforce budget constraints during learning, not after.

Method

GRASPrune learns gate scores with a projected straight-through estimator to enforce a hard mask, then calibrates and folds scaling factors into pruned weights to create a smaller, dense model.

In practice

Apply GRASPrune to LLaMA-2-7B for 50% parameter reduction.
Utilize 512 unlabeled sequences for calibration.

Topics

GRASPrune
Structured Pruning
Large Language Models
Global Gating
FFN Channels Pruning

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.