The Shape of Wisdom: Decision Trajectories in Language Models
Summary
A study on language model decision trajectories, titled "The Shape of Wisdom," analyzed 9,000 trajectories across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3 models using the MMLU benchmark. The research describes each trajectory by its current answer margin, the next-layer change in that margin, and the distance from a decision flip. A key finding is that correctness and stability are distinct properties, with the largest group of answers being "unstable-correct." Further analysis on a traced subset revealed that in stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not. Span deletion experiments showed that removing answer-supporting text hurts the margin, and removing distractor-like text helps it. This work provides a reproducible method to identify settled versus fragile answers and their influencing sources.
Key takeaway
For machine learning engineers evaluating language model robustness, understanding decision trajectories reveals which answers are fragile despite being correct. You should analyze these trajectories to identify and mitigate potential failure points in critical applications, especially when deploying models like Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, or Mistral-7B-Instruct-v0.3. This approach helps ensure model reliability beyond simple accuracy metrics.
Key insights
Correctness and stability are distinct properties of language model decision trajectories.
Principles
- LM answers evolve across layers, not just at output.
- Unstable-correct answers form the largest group.
- Attention scalars align with correct answers; MLP scalars do not.
Method
Describing trajectories with answer margin, next-layer margin change, and distance from decision flip, then tracing subsets to analyze scalar contributions and span deletion effects.
In practice
- Identify fragile LM answers before deployment.
- Pinpoint text segments influencing LM decisions.
Topics
- Language Models
- Decision Trajectories
- MMLU Benchmark
- Model Stability
- Attention Mechanisms
- MLP Layers
- Model Interpretability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.