Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

Cross-Layer Transcoders (CLTs) are introduced as sparse, depth-aware proxy models for MLP blocks within Vision Transformers (ViTs), aiming to enhance interpretability and trustworthiness. Unlike Sparse Autoencoders, CLTs capture cross-layer computational structure by reconstructing post-MLP activations from sparse embeddings of preceding layers. The researchers trained CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100 datasets, demonstrating high reconstruction fidelity (cosine similarities 0.92–0.97, R² 0.89–0.95) and preserved or improved CLIP zero-shot classification accuracy. CLTs provide faithful attribution through cross-layer contribution scores, revealing that the final representation is concentrated in a small subset of dominant layers. Ablation experiments showed removing the single highest-scored layer degraded accuracy by up to 5.6%, while retaining only the top-4 layers recovered near-baseline performance.

Key takeaway

Research scientists working on Vision Transformer interpretability should consider integrating Cross-Layer Transcoders (CLTs) into their analysis workflows. CLTs provide a robust method for understanding cross-layer contributions and identifying critical layers, especially for the [CLS] token, without significantly impacting zero-shot classification performance. This enables a more granular, depth-aware understanding of how information flows and aggregates within ViTs, which is crucial for building more transparent and controllable models.

Key insights

Cross-Layer Transcoders offer a depth-aware, interpretable alternative to Vision Transformer MLP blocks without performance compromise.

Principles

Cross-layer context improves interpretability over single-layer analysis.
Final ViT representations are shaped by a few dominant layers.
CLS tokens aggregate information broadly across layers.

Method

CLTs reconstruct post-MLP activations from sparse features of preceding layers using an encoder-decoder scheme. Three sparsity functions (JumpReLU, ReLU-Top-k, Abs-Top-k) were tested, with ReLU-Top-k often performing best.

In practice

Replace ViT MLP blocks with CLTs for interpretability.
Focus on [CLS] token or late-layer substitutions for minimal accuracy impact.
Use projection-based attribution scores to identify critical layers.

Topics

Cross-Layer Transcoders
Vision Transformers
Model Interpretability
Sparse Autoencoders
Zero-Shot Classification

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.