Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Summary
A new study introduces a 2x2 framework to functionally dissociate uncertainty and correctness in large language models (LLMs) by analyzing internal sparse autoencoder (SAE) features. Applying this framework to Llama-3.1-8B and Gemma-2-9B, researchers identified three distinct feature populations: pure uncertainty, pure incorrectness, and confounded features. Pure uncertainty features are functionally essential, with their suppression severely degrading accuracy. Pure incorrectness features are functionally inert, showing statistically significant activation differences but producing near-zero accuracy changes when suppressed. Confounded features, which encode both signals, were found to be detrimental to output quality; their targeted suppression yielded a 1.1% accuracy improvement and a 75% entropy reduction on MMLU, with effects transferring across ARC-Challenge and RACE benchmarks. The study also found that just 3 confounded features from a single mid-network layer could predict model correctness with an AUROC of ~0.79, enabling selective abstention that raised accuracy from 62% to 81% at 53% coverage.
Key takeaway
For AI Engineers and Research Scientists focused on improving LLM reliability, understanding the functional dissociation between uncertainty and correctness is crucial. You should prioritize identifying and suppressing "confounded features" within your models, as this intervention has been shown to improve accuracy and significantly reduce output entropy. Furthermore, consider leveraging a small set of these confounded features for internal-feature-based abstention, which can substantially boost effective accuracy by allowing the model to decline uncertain predictions.
Key insights
LLM uncertainty and correctness are distinct internal phenomena encoded by functionally different feature populations.
Principles
- Uncertainty features are essential for accuracy.
- Incorrectness features are often inert.
- Confounded features degrade output quality.
Method
A 2x2 quadrant framework partitions predictions by correctness and confidence, using sparse autoencoders and Mann-Whitney U tests to identify distinct feature populations for targeted suppression.
In practice
- Suppress confounded features to improve accuracy.
- Use 3 confounded features for selective abstention.
- Apply findings across different LLM architectures.
Topics
- Sparse Autoencoders
- LLM Uncertainty Quantification
- Model Correctness Prediction
- Feature Suppression
- Llama-3.1-8B
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.