The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

2026-06-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

A new study reveals a "benchmark illusion" in compressed large language models, where pruned LLMs can pass multiple-choice evaluations but fail to generate correct answers in open-ended questions. Researchers found that high-sparsity pruning, particularly using the Wanda method, often causes models to underperform in greedy open generation while still recognizing the correct answer in a multiple-choice format. This phenomenon indicates that the correct answer is typically demoted in output probability rather than erased, frequently reappearing with techniques like beam search, sampling, or a single in-context example. This discrepancy suggests that standard multiple-choice benchmarks can significantly overstate the practical usability of compressed LLMs, highlighting a critical evaluation blind spot. The findings emphasize the need to test compressed models on their generative capabilities, not solely on recognition.

Key takeaway

For Machine Learning Engineers deploying compressed LLMs, you must move beyond multiple-choice benchmarks to accurately assess model usability. Relying solely on recognition scores can lead to deploying models that fail in open-ended generative tasks, even if the correct answer is merely demoted. You should implement generative evaluations, potentially incorporating beam search, sampling, or few-shot prompting, to ensure your pruned models meet practical performance requirements.

Key insights

Pruned LLMs often recognize correct answers in multiple-choice but fail to generate them, revealing a benchmark illusion.

Principles

High-sparsity pruning can demote correct answers.
Multiple-choice benchmarks overstate compressed LLM usability.
Generative testing is crucial for compressed models.

Method

Study multilingual question answering by tracking questions pre- and post-pruning to identify recognition-only errors in compressed LLMs.

In practice

Use beam search or sampling to recover demoted answers.
Add one in-context example to aid pruned LLM generation.
Prioritize generative evaluations for compressed LLMs.

Topics

LLM Compression
Model Pruning
LLM Benchmarking
Generative Evaluation
Wanda Algorithm
Multiple-Choice Bias

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.