Size Doesn't Matter: Cosine-Scored Sparse Autoencoders
Summary
Cosine-Scored Sparse Autoencoders (SAEs) address a limitation in standard SAEs where feature activation, based on inner product, scales with both directional alignment and input norm. This is problematic because sublayer normalization in models discards magnitude, causing standard SAEs to detect a quantity the model doesn't use, leading to wasted dictionary slots on "norm detectors." The proposed method replaces the inner product score with a learned blend of cosine similarity and input magnitude, allowing the optimizer to determine the optimal norm usage, either globally or per-feature. Experiments show that the optimizer consistently chooses less than half-magnitude dependence. At matched reconstruction, the cosine encoder learns features that align with human-recognizable concepts more frequently than standard SAEs, efficiently utilizing dictionary slots. The forward-pass score geometry is identified as the primary lever for this improvement, suggesting cosine scoring as a default for dictionary learning on normalized representations, despite its non-universal advantage across all tasks or depths.
Key takeaway
For Machine Learning Engineers developing sparse autoencoders on normalized representations, you should consider adopting cosine scoring as the default. This approach significantly improves feature interpretability and dictionary slot utilization by decoupling feature activation from input magnitude, which is often discarded by sublayer normalization. Implementing this can lead to more meaningful and human-recognizable learned features, even if its advantage isn't universal across all tasks.
Key insights
Cosine-scored sparse autoencoders improve feature learning by decoupling activation from input norm, aligning features with human concepts.
Principles
- Standard SAEs waste dictionary slots on norm detection.
- Sublayer normalization discards magnitude, making norm-dependent scoring inefficient.
- Learned cosine-magnitude blend improves feature interpretability.
Method
Replace the standard inner product score in SAEs with a learned blend of cosine similarity and input magnitude, allowing global or per-feature optimization of norm dependence.
In practice
- Implement cosine scoring for SAEs on normalized representations.
- Prioritize feature interpretability in dictionary learning.
- Evaluate cosine encoders for tasks beyond universal advantage.
Topics
- Sparse Autoencoders
- Feature Learning
- Cosine Similarity
- Neural Network Interpretability
- Representation Learning
- Machine Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.