MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Emerging Technologies & Innovation · Depth: Advanced, extended

Summary

MaMe (Matrix-Based Token Merging) and MaRe (Matrix-Based Token Restoration) are a new framework designed to enhance the efficiency of Vision Transformers (ViTs) and image synthesis models by addressing the quadratic complexity of self-attention. Unlike prior methods like ToMe, MaMe uses only GPU-friendly matrix operations, avoiding inefficient sorting or scattered writes, and is training-free and differentiable. When applied to pre-trained ViT-B models, MaMe doubles throughput with a 2% accuracy drop, and fine-tuning the last layer can boost accuracy by 1.0% at 1.1x speed. For SigLIP2-B@512 zero-shot classification, it provides 1.3x acceleration with minimal performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. The MaMe+MaRe pipeline also improves image quality and reduces Stable Diffusion v2.1 generation latency by 31%, demonstrating its effectiveness across various vision tasks.

Key takeaway

For AI Engineers and Research Scientists working with Vision Transformers or generative models, integrating MaMe and MaRe can significantly boost computational efficiency without substantial accuracy loss. You should consider fine-tuning pre-trained models with MaMe, especially applying it to the last transformer block, to achieve both speedup and potential accuracy improvements. This framework also offers a promising approach for enhancing image and video generation quality and could be adapted for efficient KV cache reduction in LLMs.

Key insights

MaMe and MaRe offer GPU-efficient, matrix-based token merging and restoration for accelerating Vision Transformers and image synthesis.

Principles

Method

MaMe partitions tokens into destination and source sets, computes cosine similarity, refines weights with dynamic thresholds, and aggregates tokens. MaRe reconstructs tokens using the stored fusion matrix.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.