Localizing Anchoring Pathways in Language Models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Research investigating anchoring effects in language models reveals how irrelevant numerical prompts shift LLM judgments. Using a controlled multiple-choice task and a logit-difference metric, researchers applied attribution-based circuit localization, specifically EAP-IG, to 7B-8B Qwen and Llama base and instruction-tuned models. The study found that edge-level localization methods more faithfully recover anchor-sensitive signals than node-level methods. Within a model, low- and high-anchor circuits exhibit strong transferability and shared pathway structure, with approximately two-thirds of their top 5% edges overlapping. However, transferability across base and instruction-tuned variants is less consistent, suggesting that post-training significantly alters the most relevant pathways. Qwen models show attribution in mid-to-late layers, while Llama models concentrate it earlier, indicating family-specific differences in localization patterns.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on LLM robustness, understanding anchoring bias requires granular, edge-level circuit analysis. You should prioritize methods like EAP-IG to identify specific pathways, as low and high anchor effects share significant internal structure. Be aware that instruction tuning can alter these critical pathways, necessitating re-evaluation of bias mitigation strategies after post-training to ensure continued effectiveness.

Key insights

Anchoring bias in LLMs is localized to sparse, shared internal pathways, but these pathways shift with instruction tuning.

Principles

Edge-level attribution is more faithful for localizing LLM behavior.
Low and high anchor effects share substantial circuit structure.
Instruction tuning alters critical sparse edges and pathway importance.

Method

Circuit localization uses attribution patching (EAP-IG) on a logit-difference metric comparing correct vs. anchor answer options in a multiple-choice task to identify anchor-sensitive pathways.

In practice

Prioritize EAP-IG for precise mechanistic interpretability studies.
Re-evaluate bias mitigation strategies post-instruction tuning.
Consider model family for layer-wise attribution patterns.

Topics

Language Models
Anchoring Bias
Mechanistic Interpretability
Circuit Localization
Attribution Patching
Instruction Tuning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.