Linearizing Vision Transformer with Test-Time Training

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A new method addresses the challenge of converting Softmax attention-based Vision Transformers to linear-complexity attention mechanisms without expensive retraining. The approach, called SD3.5-T^5 (Transformer To Test Time Training), leverages Test-Time Training (TTT) due to its structural alignment with Softmax attention, allowing direct inheritance of pretrained weights. To bridge the representational gap, the method introduces key instance normalization and a lightweight locality enhancement module. Validated by linearizing Stable Diffusion 3.5, SD3.5-T^5 achieved comparable text-to-image quality to the fine-tuned Softmax model after only 1 hour of fine-tuning on 4xH20 GPUs. This conversion resulted in inference speedups of 1.32x and 1.47x at 1K and 2K resolutions, respectively.

Key takeaway

For AI Engineers optimizing Vision Transformer inference, this work demonstrates a viable path to significantly accelerate models like Stable Diffusion 3.5. You should consider applying Test-Time Training and the proposed alignment techniques to convert existing Softmax attention models, potentially achieving substantial speedups (e.g., 1.47x at 2K resolution) with minimal fine-tuning (1 hour on 4xH20 GPUs), thereby reducing operational costs and improving user experience.

Key insights

Linearizing Vision Transformers with Test-Time Training enables efficient weight transfer and faster inference.

Principles

Method

The method aligns Softmax and linear attention via Test-Time Training, key instance normalization, and a locality enhancement module to transfer pretrained weights.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.