Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study investigating apparent Large Language Model (LLM) triage failures, particularly the high under-triage rates reported for consumer LLMs in constrained multiple-choice output compared to free-text, concludes that these failures stem from the output format, not a lack of clinical knowledge. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, researchers found that medical features activate on the clinical narrative under both formats but become silent at the multiple-choice decision token. Three independent methods—natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization—confirmed that scaffold and format features, not medical features, drive decision logits. The multiple-choice penalty inverts under structured and natural-language input, and failures are dominated by "off-by-one" errors.

Key takeaway

For AI Scientists evaluating LLM clinical triage performance, recognize that apparent knowledge failures often reflect output format biases rather than a deficit in the model's internal clinical understanding. You should investigate the influence of scaffold and format features on decision logits, especially when comparing multiple-choice versus free-text outputs. This perspective shifts diagnostic efforts from knowledge retrieval to the model's decision-mapping mechanisms, potentially revealing "off-by-one" errors as a primary failure mode.

Key insights

LLM clinical triage failures originate in output format mechanisms, not internal clinical knowledge representation.

Principles

Same medical features fire across output formats
Output format features drive decision logits, not clinical features
Triage failures are often "off-by-one" errors

Method

Employed sparse-autoencoder (SAE) features, natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization to analyze LLM internal representations.

In practice

Analyze SAE features to diagnose LLM decision-making
Investigate decision-token logit attribution for format influence
Scaffold and format features are critical for output reliability

Topics

Large Language Models
Clinical Triage
Sparse Autoencoders
Model Evaluation
Output Format
Medical AI

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.