Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Summary
Perturbation probing is a diagnostic method that identifies task-specific causal hypotheses for FFN neurons in large language models (LLMs) using two forward passes per prompt and no backpropagation. This is followed by a one-time intervention sweep of approximately 150 passes. The technique identified two primary circuit structures across eight behavioral circuits, 13 models, and four architecture families: "Opposition circuits" suppress pre-training tendencies, as seen in safety refusal where 50 neurons (0.014% of total) control refusal templates, altering 80% of response formats on 520 AdvBench prompts with near-zero harmful compliance. "Routing circuits" manage pre-training behaviors distributed via attention, exemplified by language selection where residual-stream direction injection switches English to Chinese output on 99.1% of 580 prompts in specific bilingual models. The FFN-to-skip signal ratio, derived from the initial two passes, distinguishes these structures and predicts intervention suitability. Circuit topology varies, with Qwen3.5-2B showing that ablating 20 neurons eliminates sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52% to 88% on TruthfulQA.
Key takeaway
For research scientists investigating LLM interpretability and behavioral control, Perturbation Probing offers a precise, efficient toolkit for understanding and editing model behavior. You should consider applying this two-pass diagnostic to identify and target specific FFN neurons responsible for behaviors like safety refusal or language switching, enabling fine-grained control over model outputs without extensive retraining.
Key insights
Perturbation probing efficiently diagnoses LLM FFN circuits to understand and precisely edit model behavior.
Principles
- FFN-to-skip signal ratio predicts intervention type.
- Circuit topology varies by LLM architecture.
Method
Perturbation probing involves two forward passes per prompt to generate causal hypotheses for FFN neurons, followed by a one-time intervention sweep across identified neurons without backpropagation.
In practice
- Ablate 50 neurons for safety refusal control.
- Amplify 10 neurons to improve factual correction.
- Use residual-stream injection for language steering.
Topics
- Perturbation Probing
- FFN Behavioral Circuits
- Mechanistic Interpretability
- Opposition Circuits
- Routing Circuits
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.