Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
Summary
A study by Ocean Monjur, Shahriar Kabir Nahin, and Anshuman Chhabra from the University of South Florida investigates the impact of Large Language Model (LLM) pruning on Test-Time Scaling (TTS) performance. Contrary to prior assumptions that pruning degrades TTS reasoning, their extensive experiments on s1.1-7B and Qwen3-8B LLMs across four reasoning benchmarks (MATH500, AIME24, AMC23, GPQA-Diamond) reveal that unstructured pruning methods, such as Magnitude and Wanda, consistently augment TTS performance. This contrasts with structured pruning, which was confirmed to degrade performance. The research also explores the effect of different layer-wise sparsity allocation strategies, including Uniform, Outlier Weighted Layerwise Sparsity (OWL), and LayerIF, finding that non-uniform strategies can mitigate performance degradation in weaker models like s1.1-7B, while uniform allocation remains competitive for more performant LLMs like Qwen3-8B.
Key takeaway
For AI Engineers optimizing LLM inference costs and performance, consider implementing unstructured pruning techniques. These methods, particularly Magnitude and Wanda, can not only reduce model size but also enhance Test-Time Scaling capabilities, potentially outperforming unpruned models. Evaluate different layer-wise sparsity allocation strategies, as non-uniform approaches like OWL or LayerIF can be beneficial for less performant models, while uniform allocation may suffice for robust LLMs like Qwen3-8B.
Key insights
Unstructured pruning can enhance LLM test-time scaling performance, challenging prior assumptions about pruning's detrimental effects.
Principles
- Unstructured pruning can exceed unpruned LLM performance.
- Structured pruning degrades TTS performance.
- Parameter redundancy can lead to overthinking in LLMs.
Method
The study evaluates unstructured pruning (Magnitude, Wanda) and structured pruning (ShortGPT) on s1.1-7B and Qwen3-8B LLMs across four reasoning benchmarks, varying thinking token limits (512-8192) and sparsity allocation strategies (Uniform, OWL, LayerIF) at 10% and 20% global sparsity.
In practice
- Apply unstructured pruning for efficient, high-performing LLMs.
- Consider non-uniform sparsity allocation for weaker LLMs.
- Uniform sparsity allocation is effective for robust LLMs.
Topics
- LLM Pruning
- Test-Time Scaling
- Unstructured Pruning
- Structured Pruning
- Sparsity Allocation Strategies
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.