Localizing Anchoring Pathways in Language Models
Summary
A study published on 2026-06-11 investigates how irrelevant numerical information in prompts creates anchoring effects in language model judgments, particularly in numerical reasoning tasks. Researchers employed a controlled multiple-choice setup and defined a logit-difference metric to track behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, the analysis revealed that edge-level methods more faithfully recover anchor-sensitive signals than node-level methods. The findings indicate strong transfer of low- and high-anchor circuits within a single model, suggesting shared pathway structures. However, transfer across base and instruction-tuned variants was sparse, highlighting that post-training significantly alters which internal pathways are most critical for these decisions. This work provides a mechanistic account of how anchoring-related decision signals are carried within language models.
Key takeaway
For AI Scientists and Machine Learning Engineers focused on model robustness and fairness, understanding the mechanistic basis of biases like anchoring is crucial. This research demonstrates that anchoring effects are localized to specific internal pathways within models, with post-training significantly influencing these critical decision circuits. You should prioritize investigating how fine-tuning or instruction-tuning modifies internal decision-making to build more reliable and less susceptible language models.
Key insights
Irrelevant numerical anchors in prompts shift LM judgments via specific, localizable internal pathways.
Principles
- Anchoring effects are mechanistically localizable within LMs.
- Edge-level attribution excels in circuit localization.
- Post-training alters critical decision pathways.
Method
A logit-difference metric tracks behavioral anchoring. Attribution-based circuit localization on multiple-choice setups identifies anchor-sensitive signal pathways.
In practice
- Use edge-level attribution for LM circuit analysis.
- Consider post-training impact on model decision pathways.
Topics
- Language Models
- Anchoring Effects
- Circuit Localization
- Attribution Methods
- Qwen
- Llama
- Instruction Tuning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.