Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
Summary
A study on Qwen3-14B challenges the common interpretation of linear probes in large language models, finding that high probe accuracy for reasoning types reflects task format rather than distinct internal computational structures. Researchers probed hidden states of Qwen3-14B on LogiQA 2.0 (deductive), ARC-Challenge (inductive), and αNLI (abductive) benchmarks. Initially, linear probes achieved 100% cross-validated accuracy at layer 32, showing distinct manifold geometry with intrinsic dimensionalities of 20.6, 28.5, and 33.6. However, a four-stage confound analysis, including residualizing format features like source identity, option count, and response length, reduced probe accuracy to chance level. Furthermore, trace-anchor similarity indicated only 42.5% agreement with intended reasoning modes, suggesting a uniform reasoning strategy. Causal steering experiments with random controls (n=20) yielded a p=0.286, confirming no functional link between the observed geometry and reasoning mode selection.
Key takeaway
For AI Scientists and Machine Learning Engineers interpreting LLM internal states, you must critically re-evaluate linear probing results. Your high probe accuracy for reasoning modes may merely reflect task format differences, not genuine computational distinctions. To avoid misinterpreting model capabilities, integrate format deconfounding, such as residual analysis, and random-direction controls into your interpretability pipelines. This ensures you are detecting functional structure, not superficial artifacts, guiding more accurate model development and evaluation.
Key insights
Linear probes in LLMs often detect task format confounds, not distinct reasoning mode representations, challenging common interpretability claims.
Principles
- High linear probe accuracy is insufficient evidence for distinct internal representations.
- Reasoning mode labels are often confounded with dataset source.
- LLMs may employ a largely uniform reasoning strategy across task types.
Method
A four-stage pipeline: multi-source dataset construction, hidden-state extraction, layer-wise linear probing, format confound analysis (residualization), and causal steering with random-direction controls.
In practice
- Always report source-prediction accuracy alongside mode-prediction.
- Implement residual analysis to deconfound format features.
- Use random-direction controls in steering-vector experiments.
Topics
- Linear Probing
- LLM Interpretability
- Reasoning Modes
- Format Confounding
- Causal Steering
- Qwen3-14B
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.