MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
Summary
MaMe (Matrix-Based Token Merging) is a new training-free, differentiable token merging method designed to accelerate Vision Transformers (ViTs) by addressing the quadratic complexity of self-attention. Unlike existing methods like ToMe, MaMe uses only GPU-friendly matrix operations, avoiding inefficient sorting and scattered writes. The authors also introduce MaRe, an inverse operation for token restoration, creating a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop, and fine-tuning the last layer can boost ViT-B accuracy by 1.0% at 1.1x speed. It accelerates SigLIP2-B@512 zero-shot classification by 1.3x with negligible degradation and VideoMAE-L by 48.5% on Kinetics-400 with a 0.84% accuracy loss. For image synthesis, the MaMe+MaRe pipeline reduces Stable Diffusion v2.1 generation latency by 31% while enhancing quality.
Key takeaway
For AI Engineers optimizing Vision Transformer inference, MaMe offers a significant performance boost without complex retraining. You should consider integrating MaMe into your existing ViT pipelines to achieve up to double throughput or improve accuracy with minimal speed overhead, especially for large models or real-time applications. The MaMe+MaRe pipeline also provides a clear path to accelerate image synthesis models like Stable Diffusion, reducing generation latency by nearly a third.
Key insights
MaMe and MaRe offer GPU-efficient, matrix-based token merging and restoration for accelerating Vision Transformers and image synthesis.
Principles
- Matrix operations enhance GPU efficiency.
- Token merging reduces quadratic complexity.
- Inverse operations enable synthesis pipelines.
Method
MaMe performs training-free, differentiable token merging via matrix operations. MaRe is its inverse for token restoration, forming a pipeline for tasks like image synthesis.
In practice
- Apply MaMe to pre-trained ViTs for 2x throughput.
- Fine-tune last layer with MaMe for accuracy gains.
- Use MaMe+MaRe for faster Stable Diffusion.
Topics
- MaMe
- MaRe
- Vision Transformers
- Token Merging
- Image Synthesis
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.