The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer
Summary
A new study reveals a "benchmark illusion" in compressed large language models, where pruned LLMs can pass multiple-choice evaluations but fail to generate correct answers in open-ended questions. Researchers found that high-sparsity pruning, particularly using the Wanda method, often causes models to underperform in greedy open generation while still recognizing the correct answer in a multiple-choice format. This phenomenon indicates that the correct answer is typically demoted in output probability rather than erased, frequently reappearing with techniques like beam search, sampling, or a single in-context example. This discrepancy suggests that standard multiple-choice benchmarks can significantly overstate the practical usability of compressed LLMs, highlighting a critical evaluation blind spot. The findings emphasize the need to test compressed models on their generative capabilities, not solely on recognition.
Key takeaway
For Machine Learning Engineers deploying compressed LLMs, you must move beyond multiple-choice benchmarks to accurately assess model usability. Relying solely on recognition scores can lead to deploying models that fail in open-ended generative tasks, even if the correct answer is merely demoted. You should implement generative evaluations, potentially incorporating beam search, sampling, or few-shot prompting, to ensure your pruned models meet practical performance requirements.
Key insights
Pruned LLMs often recognize correct answers in multiple-choice but fail to generate them, revealing a benchmark illusion.
Principles
- High-sparsity pruning can demote correct answers.
- Multiple-choice benchmarks overstate compressed LLM usability.
- Generative testing is crucial for compressed models.
Method
Study multilingual question answering by tracking questions pre- and post-pruning to identify recognition-only errors in compressed LLMs.
In practice
- Use beam search or sampling to recover demoted answers.
- Add one in-context example to aid pruned LLM generation.
- Prioritize generative evaluations for compressed LLMs.
Topics
- LLM Compression
- Model Pruning
- LLM Benchmarking
- Generative Evaluation
- Wanda Algorithm
- Multiple-Choice Bias
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.