Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

A new method, "Semantic Optimal Transport," addresses key challenges in interpreting language models using Sparse Autoencoders (SAEs). Specifically, it tackles the difficulty of matching semantically similar features across multi-layers and compressing large feature circuits into interpretable supernodes. The authors, Tue M. Cao, Nguyen Do, and My T. Thai, unify these problems as estimating semantic distances between SAE features on different activation manifolds. Their distributional framework represents each feature as an activation-weighted distribution over hidden states, rather than a single decoder vector. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, the method offers a unified semantic metric for cross-layer feature comparison. This approach is proven invariant to activation rescaling, stable under perturbations, and recovers true matches. Empirically, it outperforms decoder-vector and LLM-based baselines, capturing subtle functional distinctions and automatically compressing large feature circuits.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on interpreting large language models, this Semantic Optimal Transport method offers a robust approach to analyze Sparse Autoencoder features. You can now reliably match semantically similar features across different layers and automatically compress complex feature circuits into interpretable supernodes. This improves your ability to understand model internals and debug emergent behaviors more effectively. Consider integrating this distributional framework for deeper mechanistic interpretability.

Key insights

Semantic Optimal Transport unifies SAE feature matching and circuit compression via activation-weighted distributions and Wasserstein distance.

Principles

Method

The method represents SAE features as activation-weighted distributions over hidden states, projects them into a shared reference space, and compares them using Wasserstein distance for unified semantic metrics.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.