LLM-Pruning Collection: A JAX Based Repo For Structured And Unstructured LLM Compression
Summary
Princeton Zlab researchers have released LLM-Pruning Collection, a JAX-based repository that unifies major pruning algorithms for large language models (LLMs) into a single, reproducible framework. This collection aims to simplify the comparison of block-level, layer-level, and weight-level pruning methods under consistent training and evaluation stacks on both GPUs and TPUs. It includes implementations for Minitron, ShortGPT, Wanda, SparseGPT, Magnitude, Sheared Llama, and LLM-Pruner. The repository integrates FMS-FSDP for GPU training and MaxText for TPU training, alongside JAX-compatible evaluation scripts built around lm-eval-harness, which offers 2 to 4 times speedup for MaxText checkpoints. The collection also provides "paper vs reproduced" tables to verify results against established baselines.
Key takeaway
For AI Engineers and Research Scientists focused on LLM compression, LLM-Pruning Collection offers a standardized environment to compare and implement various pruning techniques. You can use this repository to reproduce established pruning results, experiment with different granularity levels (block, layer, weight), and verify your own compression strategies against known baselines, potentially optimizing model deployment on diverse hardware like GPUs and TPUs.
Key insights
LLM-Pruning Collection unifies diverse LLM pruning methods within a consistent JAX-based framework for reproducible comparison.
Principles
- Pruning can occur at block, layer, or weight levels.
- Post-training pruning can achieve high sparsity without retraining.
- Layer redundancy enables direct layer deletion for compression.
Method
The repository provides a unified workflow for LLM pruning, integrating various algorithms with shared training (FMS-FSDP, MaxText) and evaluation (lm-eval-harness) pipelines across GPUs and TPUs.
In practice
- Reproduce pruning results for Llama 2 7B models.
- Apply Minitron-style pruning to Llama 3.1 8B.
- Evaluate Wanda, SparseGPT, Magnitude on various benchmarks.
Topics
- LLM Pruning
- JAX Framework
- Model Compression Algorithms
- Large Language Models
- GPU/TPU Training
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Deep Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MarkTechPost.