Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

2026-04-30 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Perturbation probing is a diagnostic method that identifies task-specific causal hypotheses for FFN neurons in large language models (LLMs) using two forward passes per prompt and no backpropagation. This is followed by a one-time intervention sweep of approximately 150 passes. The technique identified two primary circuit structures across eight behavioral circuits, 13 models, and four architecture families: "Opposition circuits" suppress pre-training tendencies, as seen in safety refusal where 50 neurons (0.014% of total) control refusal templates, altering 80% of response formats on 520 AdvBench prompts with near-zero harmful compliance. "Routing circuits" manage pre-training behaviors distributed via attention, exemplified by language selection where residual-stream direction injection switches English to Chinese output on 99.1% of 580 prompts in specific bilingual models. The FFN-to-skip signal ratio, derived from the initial two passes, distinguishes these structures and predicts intervention suitability. Circuit topology varies, with Qwen3.5-2B showing that ablating 20 neurons eliminates sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52% to 88% on TruthfulQA.

Key takeaway

For research scientists investigating LLM interpretability and behavioral control, Perturbation Probing offers a precise, efficient toolkit for understanding and editing model behavior. You should consider applying this two-pass diagnostic to identify and target specific FFN neurons responsible for behaviors like safety refusal or language switching, enabling fine-grained control over model outputs without extensive retraining.

Key insights

Perturbation probing efficiently diagnoses LLM FFN circuits to understand and precisely edit model behavior.

Principles

FFN-to-skip signal ratio predicts intervention type.
Circuit topology varies by LLM architecture.

Method

Perturbation probing involves two forward passes per prompt to generate causal hypotheses for FFN neurons, followed by a one-time intervention sweep across identified neurons without backpropagation.

In practice

Ablate 50 neurons for safety refusal control.
Amplify 10 neurons to improve factual correction.
Use residual-stream injection for language steering.

Topics

Perturbation Probing
FFN Behavioral Circuits
Mechanistic Interpretability
Opposition Circuits
Routing Circuits

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.