Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A new benchmarking framework introduces a GEMM-centric taxonomy to systematically evaluate large language model (LLM) pruning methods, which accelerate inference by removing computation across various dimensions. This work addresses the poor understanding of practical acceleration benefits due to diverse execution behaviors and hardware dependencies. The framework enables implementation-consistent comparison across the pruning design space, characterizing the acceleration-quality Pareto frontier. Results indicate that static depth pruning is the strongest Pareto-optimal baseline, achieving near-theoretical acceleration in memory-bounded scenarios. During prefill, the optimal pruning strategy shifts: static depth pruning is best for low quality loss (0%-4%), dynamic depth pruning for moderate loss (5%-16%), and static width pruning for higher loss levels (17%-26%). These findings offer a unified view of the practical limits of pruning-based LLM acceleration.

Key takeaway

For Machine Learning Engineers optimizing LLM inference, understanding pruning's real-world impact is crucial. You should align your pruning strategy with your acceptable quality loss. For minimal loss (0%-4%), prioritize static depth pruning. If you can tolerate moderate loss (5%-16%), dynamic depth pruning becomes optimal. For higher loss (17%-26%), static width pruning offers the best acceleration. This guidance helps you select the most effective pruning method for your specific performance and quality trade-offs.

Key insights

A GEMM-centric taxonomy unifies LLM pruning evaluation, revealing practical acceleration limits and optimal strategies across quality loss levels.

Principles

Method

The work introduces a GEMM-centric taxonomy to reorganize pruning methods, then builds a unified benchmarking framework for consistent comparison and Pareto frontier characterization.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.