Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study on Qwen3-14B challenges the common interpretation of linear probes in large language models, finding that high probe accuracy for reasoning types reflects task format rather than distinct internal computational structures. Researchers probed hidden states of Qwen3-14B on LogiQA 2.0 (deductive), ARC-Challenge (inductive), and αNLI (abductive) benchmarks. Initially, linear probes achieved 100% cross-validated accuracy at layer 32, showing distinct manifold geometry with intrinsic dimensionalities of 20.6, 28.5, and 33.6. However, a four-stage confound analysis, including residualizing format features like source identity, option count, and response length, reduced probe accuracy to chance level. Furthermore, trace-anchor similarity indicated only 42.5% agreement with intended reasoning modes, suggesting a uniform reasoning strategy. Causal steering experiments with random controls (n=20) yielded a p=0.286, confirming no functional link between the observed geometry and reasoning mode selection.

Key takeaway

For AI Scientists and Machine Learning Engineers interpreting LLM internal states, you must critically re-evaluate linear probing results. Your high probe accuracy for reasoning modes may merely reflect task format differences, not genuine computational distinctions. To avoid misinterpreting model capabilities, integrate format deconfounding, such as residual analysis, and random-direction controls into your interpretability pipelines. This ensures you are detecting functional structure, not superficial artifacts, guiding more accurate model development and evaluation.

Key insights

Linear probes in LLMs often detect task format confounds, not distinct reasoning mode representations, challenging common interpretability claims.

Principles

High linear probe accuracy is insufficient evidence for distinct internal representations.
Reasoning mode labels are often confounded with dataset source.
LLMs may employ a largely uniform reasoning strategy across task types.

Method

A four-stage pipeline: multi-source dataset construction, hidden-state extraction, layer-wise linear probing, format confound analysis (residualization), and causal steering with random-direction controls.

In practice

Always report source-prediction accuracy alongside mode-prediction.
Implement residual analysis to deconfound format features.
Use random-direction controls in steering-vector experiments.

Topics

Linear Probing
LLM Interpretability
Reasoning Modes
Format Confounding
Causal Steering
Qwen3-14B

Code references

SubramanyamSahoo/Linear-Probes-Detect-Task-Format-Not-Reasoning-Mode

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.