ADAG: Automatically Describing Attribution Graphs

2026-04-10 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

ADAG is an automated, end-to-end pipeline designed to describe attribution graphs in language models, addressing the prior reliance on manual human interpretation in circuit tracing research. Developed by researchers from Stanford University and Transluce, ADAG introduces "attribution profiles" to quantify a feature's functional role through its input and output gradient effects. The system employs a novel clustering algorithm to group features into "supernodes" and utilizes an LLM explainer-simulator setup to generate and score natural-language explanations for these groups. Tested on known circuit-tracing tasks, ADAG successfully recovers interpretable circuits and identifies steerable clusters responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct. The pipeline's efficiency improvements allow it to process large datasets, such as 10,000 examples for the math dataset on Llama 3.1 8B Instruct in approximately 6 hours using 4 H100 80GB GPUs.

Key takeaway

For AI Scientists and Research Scientists focused on LLM interpretability, ADAG offers a robust, automated solution to understand complex internal computations. This system allows you to move beyond manual circuit analysis, enabling scalable interpretation and the identification of critical neuron clusters. You can use ADAG to pinpoint features responsible for specific behaviors, such as jailbreaks, and potentially steer model outputs, enhancing both safety and control over large language models like Llama 3.1 8B Instruct.

Key insights

Automated circuit tracing and interpretation can reveal LLM internal computations and identify steerable features.

Principles

Attribution profiles quantify feature roles via gradient effects.
Multi-view spectral clustering groups functionally similar features.
LLM explainer-simulator generates and scores natural language explanations.

Method

ADAG identifies important features via gradient-based attribution, constructs attribution profiles, groups features using multi-view spectral clustering, and describes supernodes in natural language with an LLM explainer-simulator.

In practice

Use ADAG to automate circuit interpretation for LLM safety.
Identify steerable neuron clusters for model behavior control.
Apply gradient-based attribution for efficient circuit tracing.

Topics

ADAG Pipeline
Language Model Interpretability
Circuit Tracing
Attribution Graphs
Attribution Profiles

Code references

TransluceAI/circuits

Best for: AI Scientist, Research Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.