Demystifying Variance in Circuit Discovery of LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study investigates the substantial variability in circuit discovery methods for Large Language Models (LLMs), a key technique in mechanistic interpretability. The method EAP-IG, widely used for its performance on unfaithfulness metrics, suffers from resampling variance, where circuits change with new data batches; rephrasing variance, where circuits shift with rephrased prompts; and sample-wise variance, showing large unfaithfulness fluctuations across individual samples. The paper introduces CEAP, a new circuit discovery method with theoretical guarantees, which significantly reduces resampling variance. It demonstrates that rephrasing variance arises because different prompt templates activate distinct circuits, suggesting inherent difficulty in steering LLMs with comprehensive task circuits. The study also argues that sample-wise variance is largely benign, often stemming from the definition of unfaithfulness and selective contribution scaling rather than circuit defects. Sparsity, often claimed to create more compact circuits, does not solve the rephrasing problem.

Key takeaway

For AI Scientists and NLP Engineers interpreting LLM behavior, recognize that circuit discovery results can vary significantly. Your interpretability studies should account for rephrasing variance, as different prompts activate distinct circuits, making comprehensive steering difficult. While CEAP can reduce resampling variance, understand that sample-wise unfaithfulness scores may stem from definition rather than circuit defects. This implies a need for robust validation across diverse prompt templates.

Key insights

LLM circuit discovery faces significant variance; CEAP reduces resampling, but rephrasing variance implies inherent LLM steering difficulty.

Principles

Method

CEAP, a new circuit discovery method, improves upon EAP-IG with theoretical guarantees to substantially lessen resampling variance in LLM mechanistic interpretability.

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.