Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate
Summary
A study investigating apparent Large Language Model (LLM) triage failures, particularly the high under-triage rates reported for consumer LLMs in constrained multiple-choice output compared to free-text, concludes that these failures stem from the output format, not a lack of clinical knowledge. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, researchers found that medical features activate on the clinical narrative under both formats but become silent at the multiple-choice decision token. Three independent methods—natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization—confirmed that scaffold and format features, not medical features, drive decision logits. The multiple-choice penalty inverts under structured and natural-language input, and failures are dominated by "off-by-one" errors.
Key takeaway
For AI Scientists evaluating LLM clinical triage performance, recognize that apparent knowledge failures often reflect output format biases rather than a deficit in the model's internal clinical understanding. You should investigate the influence of scaffold and format features on decision logits, especially when comparing multiple-choice versus free-text outputs. This perspective shifts diagnostic efforts from knowledge retrieval to the model's decision-mapping mechanisms, potentially revealing "off-by-one" errors as a primary failure mode.
Key insights
LLM clinical triage failures originate in output format mechanisms, not internal clinical knowledge representation.
Principles
- Same medical features fire across output formats
- Output format features drive decision logits, not clinical features
- Triage failures are often "off-by-one" errors
Method
Employed sparse-autoencoder (SAE) features, natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization to analyze LLM internal representations.
In practice
- Analyze SAE features to diagnose LLM decision-making
- Investigate decision-token logit attribution for format influence
- Scaffold and format features are critical for output reliability
Topics
- Large Language Models
- Clinical Triage
- Sparse Autoencoders
- Model Evaluation
- Output Format
- Medical AI
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.