Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Summary
Cross-Layer Transcoders (CLTs) are introduced as sparse, depth-aware proxy models for MLP blocks within Vision Transformers (ViTs), aiming to enhance interpretability and trustworthiness. Unlike Sparse Autoencoders, CLTs capture cross-layer computational structure by reconstructing post-MLP activations from sparse embeddings of preceding layers. The researchers trained CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100 datasets, demonstrating high reconstruction fidelity (cosine similarities 0.92–0.97, R² 0.89–0.95) and preserved or improved CLIP zero-shot classification accuracy. CLTs provide faithful attribution through cross-layer contribution scores, revealing that the final representation is concentrated in a small subset of dominant layers. Ablation experiments showed removing the single highest-scored layer degraded accuracy by up to 5.6%, while retaining only the top-4 layers recovered near-baseline performance.
Key takeaway
Research scientists working on Vision Transformer interpretability should consider integrating Cross-Layer Transcoders (CLTs) into their analysis workflows. CLTs provide a robust method for understanding cross-layer contributions and identifying critical layers, especially for the [CLS] token, without significantly impacting zero-shot classification performance. This enables a more granular, depth-aware understanding of how information flows and aggregates within ViTs, which is crucial for building more transparent and controllable models.
Key insights
Cross-Layer Transcoders offer a depth-aware, interpretable alternative to Vision Transformer MLP blocks without performance compromise.
Principles
- Cross-layer context improves interpretability over single-layer analysis.
- Final ViT representations are shaped by a few dominant layers.
- CLS tokens aggregate information broadly across layers.
Method
CLTs reconstruct post-MLP activations from sparse features of preceding layers using an encoder-decoder scheme. Three sparsity functions (JumpReLU, ReLU-Top-k, Abs-Top-k) were tested, with ReLU-Top-k often performing best.
In practice
- Replace ViT MLP blocks with CLTs for interpretability.
- Focus on [CLS] token or late-layer substitutions for minimal accuracy impact.
- Use projection-based attribution scores to identify critical layers.
Topics
- Cross-Layer Transcoders
- Vision Transformers
- Model Interpretability
- Sparse Autoencoders
- Zero-Shot Classification
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.