Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Perturbation probing is a diagnostic method that identifies task-specific causal hypotheses for FFN neurons in large language models (LLMs) using two forward passes per prompt and no backpropagation. This is followed by a one-time intervention sweep of approximately 150 passes. The technique identified two primary circuit structures across eight behavioral circuits, 13 models, and four architecture families: "Opposition circuits" suppress pre-training tendencies, as seen in safety refusal where 50 neurons (0.014% of total) control refusal templates, altering 80% of response formats on 520 AdvBench prompts with near-zero harmful compliance. "Routing circuits" manage pre-training behaviors distributed via attention, exemplified by language selection where residual-stream direction injection switches English to Chinese output on 99.1% of 580 prompts in specific bilingual models. The FFN-to-skip signal ratio, derived from the initial two passes, distinguishes these structures and predicts intervention suitability. Circuit topology varies, with Qwen3.5-2B showing that ablating 20 neurons eliminates sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52% to 88% on TruthfulQA.

Key takeaway

For research scientists investigating LLM interpretability and behavioral control, Perturbation Probing offers a precise, efficient toolkit for understanding and editing model behavior. You should consider applying this two-pass diagnostic to identify and target specific FFN neurons responsible for behaviors like safety refusal or language switching, enabling fine-grained control over model outputs without extensive retraining.

Key insights

Perturbation probing efficiently diagnoses LLM FFN circuits to understand and precisely edit model behavior.

Principles

Method

Perturbation probing involves two forward passes per prompt to generate causal hypotheses for FFN neurons, followed by a one-time intervention sweep across identified neurons without backpropagation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.