Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study by Ocean Monjur, Shahriar Kabir Nahin, and Anshuman Chhabra from the University of South Florida investigates the impact of Large Language Model (LLM) pruning on Test-Time Scaling (TTS) performance. Contrary to prior assumptions that pruning degrades TTS reasoning, their extensive experiments on s1.1-7B and Qwen3-8B LLMs across four reasoning benchmarks (MATH500, AIME24, AMC23, GPQA-Diamond) reveal that unstructured pruning methods, such as Magnitude and Wanda, consistently augment TTS performance. This contrasts with structured pruning, which was confirmed to degrade performance. The research also explores the effect of different layer-wise sparsity allocation strategies, including Uniform, Outlier Weighted Layerwise Sparsity (OWL), and LayerIF, finding that non-uniform strategies can mitigate performance degradation in weaker models like s1.1-7B, while uniform allocation remains competitive for more performant LLMs like Qwen3-8B.

Key takeaway

For AI Engineers optimizing LLM inference costs and performance, consider implementing unstructured pruning techniques. These methods, particularly Magnitude and Wanda, can not only reduce model size but also enhance Test-Time Scaling capabilities, potentially outperforming unpruned models. Evaluate different layer-wise sparsity allocation strategies, as non-uniform approaches like OWL or LayerIF can be beneficial for less performant models, while uniform allocation may suffice for robust LLMs like Qwen3-8B.

Key insights

Unstructured pruning can enhance LLM test-time scaling performance, challenging prior assumptions about pruning's detrimental effects.

Principles

Method

The study evaluates unstructured pruning (Magnitude, Wanda) and structured pruning (ShortGPT) on s1.1-7B and Qwen3-8B LLMs across four reasoning benchmarks, varying thinking token limits (512-8192) and sparsity allocation strategies (Uniform, OWL, LayerIF) at 10% and 20% global sparsity.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.