Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A new human-grounded evaluation framework quantifies the interpretability of Sparse Autoencoders (SAEs) by aligning their latent representations with human-annotated concepts. This framework, which avoids user studies, validates concept matching through targeted attribute perturbations. It introduces `synCUB` and `synCOCO`, synthetic benchmarks of paired images differing by a single attribute. The framework also proposes `Fully-Binary Matching Pursuit (FBMP)` for many-to-one mappings between SAE latents and concepts, outperforming one-to-one baselines. Furthermore, a `Targeted Attribute Perturbation Alignment Score (TAPAScore)` tests selective concept responses. The matching and TAPAScore reliably distinguish trained from untrained SAEs. Findings indicate that increased overcompleteness can reduce perturbation alignment in SAEs trained on CLIP and DINOv2 embeddings, suggesting moderate dictionary sizes yield the most interpretable SAEs.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or evaluating Sparse Autoencoders, this framework offers a robust, human-grounded method to quantify interpretability. You should consider adopting `FBMP` and `TAPAScore` to validate concept alignment and selective responses. Optimizing SAE dictionary sizes to a moderate level, as suggested by the findings, can significantly enhance the interpretability of your models, particularly when working with embeddings from models like CLIP or DINOv2.

Key insights

A new framework evaluates Sparse Autoencoder interpretability by aligning latents with human concepts using synthetic data and perturbation scores.

Principles

Method

Construct `synCUB`/`synCOCO` synthetic benchmarks. Apply `Fully-Binary Matching Pursuit (FBMP)` for concept matching. Evaluate with `Targeted Attribute Perturbation Alignment Score (TAPAScore)`.

In practice

Topics

Code references

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.