The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

A new study reveals a "benchmark illusion" in compressed large language models, where pruned LLMs can pass multiple-choice evaluations but fail to generate correct answers in open-ended questions. Researchers found that high-sparsity pruning, particularly using the Wanda method, often causes models to underperform in greedy open generation while still recognizing the correct answer in a multiple-choice format. This phenomenon indicates that the correct answer is typically demoted in output probability rather than erased, frequently reappearing with techniques like beam search, sampling, or a single in-context example. This discrepancy suggests that standard multiple-choice benchmarks can significantly overstate the practical usability of compressed LLMs, highlighting a critical evaluation blind spot. The findings emphasize the need to test compressed models on their generative capabilities, not solely on recognition.

Key takeaway

For Machine Learning Engineers deploying compressed LLMs, you must move beyond multiple-choice benchmarks to accurately assess model usability. Relying solely on recognition scores can lead to deploying models that fail in open-ended generative tasks, even if the correct answer is merely demoted. You should implement generative evaluations, potentially incorporating beam search, sampling, or few-shot prompting, to ensure your pruned models meet practical performance requirements.

Key insights

Pruned LLMs often recognize correct answers in multiple-choice but fail to generate them, revealing a benchmark illusion.

Principles

Method

Study multilingual question answering by tracking questions pre- and post-pruning to identify recognition-only errors in compressed LLMs.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.