Localizing Anchoring Pathways in Language Models

2026-06-11 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study published on 2026-06-11 investigates how irrelevant numerical information in prompts creates anchoring effects in language model judgments, particularly in numerical reasoning tasks. Researchers employed a controlled multiple-choice setup and defined a logit-difference metric to track behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, the analysis revealed that edge-level methods more faithfully recover anchor-sensitive signals than node-level methods. The findings indicate strong transfer of low- and high-anchor circuits within a single model, suggesting shared pathway structures. However, transfer across base and instruction-tuned variants was sparse, highlighting that post-training significantly alters which internal pathways are most critical for these decisions. This work provides a mechanistic account of how anchoring-related decision signals are carried within language models.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on model robustness and fairness, understanding the mechanistic basis of biases like anchoring is crucial. This research demonstrates that anchoring effects are localized to specific internal pathways within models, with post-training significantly influencing these critical decision circuits. You should prioritize investigating how fine-tuning or instruction-tuning modifies internal decision-making to build more reliable and less susceptible language models.

Key insights

Irrelevant numerical anchors in prompts shift LM judgments via specific, localizable internal pathways.

Principles

Anchoring effects are mechanistically localizable within LMs.
Edge-level attribution excels in circuit localization.
Post-training alters critical decision pathways.

Method

A logit-difference metric tracks behavioral anchoring. Attribution-based circuit localization on multiple-choice setups identifies anchor-sensitive signal pathways.

In practice

Use edge-level attribution for LM circuit analysis.
Consider post-training impact on model decision pathways.

Topics

Language Models
Anchoring Effects
Circuit Localization
Attribution Methods
Qwen
Llama
Instruction Tuning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.