Localizing Anchoring Pathways in Language Models
Summary
Research investigating anchoring effects in language models reveals how irrelevant numerical prompts shift LLM judgments. Using a controlled multiple-choice task and a logit-difference metric, researchers applied attribution-based circuit localization, specifically EAP-IG, to 7B-8B Qwen and Llama base and instruction-tuned models. The study found that edge-level localization methods more faithfully recover anchor-sensitive signals than node-level methods. Within a model, low- and high-anchor circuits exhibit strong transferability and shared pathway structure, with approximately two-thirds of their top 5% edges overlapping. However, transferability across base and instruction-tuned variants is less consistent, suggesting that post-training significantly alters the most relevant pathways. Qwen models show attribution in mid-to-late layers, while Llama models concentrate it earlier, indicating family-specific differences in localization patterns.
Key takeaway
For AI Scientists and Machine Learning Engineers focused on LLM robustness, understanding anchoring bias requires granular, edge-level circuit analysis. You should prioritize methods like EAP-IG to identify specific pathways, as low and high anchor effects share significant internal structure. Be aware that instruction tuning can alter these critical pathways, necessitating re-evaluation of bias mitigation strategies after post-training to ensure continued effectiveness.
Key insights
Anchoring bias in LLMs is localized to sparse, shared internal pathways, but these pathways shift with instruction tuning.
Principles
- Edge-level attribution is more faithful for localizing LLM behavior.
- Low and high anchor effects share substantial circuit structure.
- Instruction tuning alters critical sparse edges and pathway importance.
Method
Circuit localization uses attribution patching (EAP-IG) on a logit-difference metric comparing correct vs. anchor answer options in a multiple-choice task to identify anchor-sensitive pathways.
In practice
- Prioritize EAP-IG for precise mechanistic interpretability studies.
- Re-evaluate bias mitigation strategies post-instruction tuning.
- Consider model family for layer-wise attribution patterns.
Topics
- Language Models
- Anchoring Bias
- Mechanistic Interpretability
- Circuit Localization
- Attribution Patching
- Instruction Tuning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.