Rational Sparse Autoencoder
Summary
Rational Sparse Autoencoder (RSAE) replaces the fixed encoder nonlinearities, such as ReLU, JumpReLU, and TopK, found in standard Sparse Autoencoders (SAEs) with trainable rational functions. This innovation addresses the limitation of hard-coded sparsity mechanisms that can distort the reconstruction-versus-sparsity trade-off. RSAE's rational activations are flexible, capable of approximating existing SAE primitives while offering a richer function class to adapt to pre-activation geometry. The implementation follows a two-stage pipeline: an initialization procedure that copies baseline SAE weights and calibrates rational coefficients obtained via relaxed Remez exchange on synthetic data, followed by fine-tuning under a standard sparsity-regularized reconstruction objective. Empirically, RSAE consistently improves both reconstruction-side and downstream-behavior metrics on residual-stream activations of three open-weight language models, across all three baseline activation families and tested sparsity ranges. This upgrade adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU, without sacrificing feature-level interpretability.
Key takeaway
For Machine Learning Engineers optimizing sparse autoencoders for language model interpretability, you should consider adopting Rational Sparse Autoencoders (RSAE). RSAE's trainable rational activations consistently improve reconstruction and downstream metrics over traditional SAEs with fixed nonlinearities. This upgrade is efficient, adding minimal parameters and running quickly on consumer GPUs, while preserving feature interpretability. Evaluate RSAE to enhance your interpretability efforts without significant computational overhead.
Key insights
Rational Sparse Autoencoders (RSAE) use trainable rational functions for encoder activations, improving SAE performance and interpretability.
Principles
- Fixed encoder nonlinearities constrain SAE performance.
- Trainable rational functions offer activation flexibility.
- Improved reconstruction and downstream metrics are achievable.
Method
RSAE employs a two-stage pipeline: initialize with baseline SAE weights and rational coefficients from Remez exchange, then fine-tune using a sparsity-regularized reconstruction objective.
In practice
- Replace fixed SAE activations with trainable rational functions.
- Calibrate scale parameters alongside rational coefficients.
- Fine-tune with standard sparsity-regularized objective.
Topics
- Sparse Autoencoders
- Mechanistic Interpretability
- Rational Functions
- Language Models
- Encoder Activations
- Model Fine-tuning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.