Linearizing Vision Transformer with Test-Time Training
Summary
A new method addresses the challenge of converting Softmax attention-based Vision Transformers to linear-complexity attention mechanisms without expensive retraining. The approach, called SD3.5-T^5 (Transformer To Test Time Training), leverages Test-Time Training (TTT) due to its structural alignment with Softmax attention, allowing direct inheritance of pretrained weights. To bridge the representational gap, the method introduces key instance normalization and a lightweight locality enhancement module. Validated by linearizing Stable Diffusion 3.5, SD3.5-T^5 achieved comparable text-to-image quality to the fine-tuned Softmax model after only 1 hour of fine-tuning on 4xH20 GPUs. This conversion resulted in inference speedups of 1.32x and 1.47x at 1K and 2K resolutions, respectively.
Key takeaway
For AI Engineers optimizing Vision Transformer inference, this work demonstrates a viable path to significantly accelerate models like Stable Diffusion 3.5. You should consider applying Test-Time Training and the proposed alignment techniques to convert existing Softmax attention models, potentially achieving substantial speedups (e.g., 1.47x at 2K resolution) with minimal fine-tuning (1 hour on 4xH20 GPUs), thereby reducing operational costs and improving user experience.
Key insights
Linearizing Vision Transformers with Test-Time Training enables efficient weight transfer and faster inference.
Principles
- Architectural alignment facilitates weight transfer.
- Representational alignment improves model performance.
Method
The method aligns Softmax and linear attention via Test-Time Training, key instance normalization, and a locality enhancement module to transfer pretrained weights.
In practice
- Linearize existing Softmax models for faster inference.
- Apply key instance normalization for representational alignment.
Topics
- Linear-complexity Attention
- Vision Transformers
- Test-Time Training
- Weight Transfer
- Stable Diffusion 3.5
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.