ADAG: Automatically Describing Attribution Graphs
Summary
ADAG is an automated, end-to-end pipeline designed to describe attribution graphs in language models, addressing the prior reliance on manual human interpretation in circuit tracing research. Developed by researchers from Stanford University and Transluce, ADAG introduces "attribution profiles" to quantify a feature's functional role through its input and output gradient effects. The system employs a novel clustering algorithm to group features into "supernodes" and utilizes an LLM explainer-simulator setup to generate and score natural-language explanations for these groups. Tested on known circuit-tracing tasks, ADAG successfully recovers interpretable circuits and identifies steerable clusters responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct. The pipeline's efficiency improvements allow it to process large datasets, such as 10,000 examples for the math dataset on Llama 3.1 8B Instruct in approximately 6 hours using 4 H100 80GB GPUs.
Key takeaway
For AI Scientists and Research Scientists focused on LLM interpretability, ADAG offers a robust, automated solution to understand complex internal computations. This system allows you to move beyond manual circuit analysis, enabling scalable interpretation and the identification of critical neuron clusters. You can use ADAG to pinpoint features responsible for specific behaviors, such as jailbreaks, and potentially steer model outputs, enhancing both safety and control over large language models like Llama 3.1 8B Instruct.
Key insights
Automated circuit tracing and interpretation can reveal LLM internal computations and identify steerable features.
Principles
- Attribution profiles quantify feature roles via gradient effects.
- Multi-view spectral clustering groups functionally similar features.
- LLM explainer-simulator generates and scores natural language explanations.
Method
ADAG identifies important features via gradient-based attribution, constructs attribution profiles, groups features using multi-view spectral clustering, and describes supernodes in natural language with an LLM explainer-simulator.
In practice
- Use ADAG to automate circuit interpretation for LLM safety.
- Identify steerable neuron clusters for model behavior control.
- Apply gradient-based attribution for efficient circuit tracing.
Topics
- ADAG Pipeline
- Language Model Interpretability
- Circuit Tracing
- Attribution Graphs
- Attribution Profiles
Code references
Best for: AI Scientist, Research Scientist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.