MonoLoss: A Training Objective for Interpretable Monosemantic Representations
Summary
MonoLoss is a novel training objective designed to enhance the interpretability of neural network representations by promoting monosemantic features, where individual neurons respond to single, coherent concepts. The core innovation involves reformulating the existing MonoScore metric, which quantifies monosemanticity, from a computationally expensive $O(N^2)$ pairwise comparison to an efficient $O(N)$ single-pass algorithm. This speedup, achieving up to $1200\times$ faster evaluation and $159\times$ faster training on OpenImagesV7, enables MonoScore to be used as a direct training signal. When integrated into Sparse Autoencoders (SAEs) across various architectures (BatchTopK, TopK, JumpReLU) and vision encoders (CLIP, SigLIP2, ViT), MonoLoss consistently increases MonoScore and improves class purity, with one instance showing purity rising from 0.152 to 0.723. Additionally, using MonoLoss as an auxiliary regularizer during finetuning of ResNet-50 and CLIP-ViT-B/32 models yields accuracy gains of up to 0.6% on ImageNet-1K with minimal computational overhead.
Key takeaway
For research scientists and machine learning engineers focused on model interpretability, MonoLoss offers a practical method to achieve more semantically coherent neural representations. By adopting this plug-and-play objective, you can directly optimize for monosemanticity during SAE training or vision model finetuning, leading to clearer feature activation patterns and modest accuracy improvements on classification tasks, all while incurring negligible computational overhead. Consider experimenting with different $\lambda$ values to balance monosemanticity gains against reconstruction quality or task performance.
Key insights
MonoLoss transforms a quadratic monosemanticity metric into a linear-time training objective, significantly improving model interpretability and performance.
Principles
- Monosemanticity improves neural network interpretability.
- Directly optimizing interpretability metrics during training is effective.
- Computational efficiency is crucial for integrating metrics into training.
Method
Reformulate MonoScore from $O(N^2)$ to $O(N)$ using single-pass statistics, then apply it as a differentiable loss function $\mathcal{L}_{mono} = 1 - \text{mean}(MS_{k}^{\text{(batch)}})$ to augment base training objectives.
In practice
- Integrate MonoLoss into SAE training for more interpretable features.
- Apply MonoLoss as a regularizer during vision model finetuning.
- Use the linear-time MonoScore for efficient large-scale evaluation.
Topics
- Monosemanticity
- Sparse Autoencoders
- Neural Network Interpretability
- Training Objectives
- Computational Efficiency
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.