MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
Summary
MaMe (Matrix-Based Token Merging) and MaRe (Matrix-Based Token Restoration) are a new framework designed to enhance the efficiency of Vision Transformers (ViTs) and image synthesis models by addressing the quadratic complexity of self-attention. Unlike prior methods like ToMe, MaMe uses only GPU-friendly matrix operations, avoiding inefficient sorting or scattered writes, and is training-free and differentiable. When applied to pre-trained ViT-B models, MaMe doubles throughput with a 2% accuracy drop, and fine-tuning the last layer can boost accuracy by 1.0% at 1.1x speed. For SigLIP2-B@512 zero-shot classification, it provides 1.3x acceleration with minimal performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. The MaMe+MaRe pipeline also improves image quality and reduces Stable Diffusion v2.1 generation latency by 31%, demonstrating its effectiveness across various vision tasks.
Key takeaway
For AI Engineers and Research Scientists working with Vision Transformers or generative models, integrating MaMe and MaRe can significantly boost computational efficiency without substantial accuracy loss. You should consider fine-tuning pre-trained models with MaMe, especially applying it to the last transformer block, to achieve both speedup and potential accuracy improvements. This framework also offers a promising approach for enhancing image and video generation quality and could be adapted for efficient KV cache reduction in LLMs.
Key insights
MaMe and MaRe offer GPU-efficient, matrix-based token merging and restoration for accelerating Vision Transformers and image synthesis.
Principles
- Matrix operations are more GPU-friendly than sorting or scattered writes.
- Adaptive token compression can act as an implicit regularizer.
- Preserving high-frequency tokens enhances visual quality.
Method
MaMe partitions tokens into destination and source sets, computes cosine similarity, refines weights with dynamic thresholds, and aggregates tokens. MaRe reconstructs tokens using the stored fusion matrix.
In practice
- Apply MaMe to the last transformer block for accuracy gains.
- Use a similarity threshold of 0.8 for balanced performance.
- Integrate MaMe+MaRe for improved image synthesis quality.
Topics
- Token Merging
- Token Restoration
- Vision Transformers
- Computational Efficiency
- Image Synthesis
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.