MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Emerging Technologies & Innovation · Depth: Advanced, extended

Summary

MaMe (Matrix-Based Token Merging) and MaRe (Matrix-Based Token Restoration) are a new framework designed to enhance the efficiency of Vision Transformers (ViTs) and image synthesis models by addressing the quadratic complexity of self-attention. Unlike prior methods like ToMe, MaMe uses only GPU-friendly matrix operations, avoiding inefficient sorting or scattered writes, and is training-free and differentiable. When applied to pre-trained ViT-B models, MaMe doubles throughput with a 2% accuracy drop, and fine-tuning the last layer can boost accuracy by 1.0% at 1.1x speed. For SigLIP2-B@512 zero-shot classification, it provides 1.3x acceleration with minimal performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. The MaMe+MaRe pipeline also improves image quality and reduces Stable Diffusion v2.1 generation latency by 31%, demonstrating its effectiveness across various vision tasks.

Key takeaway

For AI Engineers and Research Scientists working with Vision Transformers or generative models, integrating MaMe and MaRe can significantly boost computational efficiency without substantial accuracy loss. You should consider fine-tuning pre-trained models with MaMe, especially applying it to the last transformer block, to achieve both speedup and potential accuracy improvements. This framework also offers a promising approach for enhancing image and video generation quality and could be adapted for efficient KV cache reduction in LLMs.

Key insights

MaMe and MaRe offer GPU-efficient, matrix-based token merging and restoration for accelerating Vision Transformers and image synthesis.

Principles

Matrix operations are more GPU-friendly than sorting or scattered writes.
Adaptive token compression can act as an implicit regularizer.
Preserving high-frequency tokens enhances visual quality.

Method

MaMe partitions tokens into destination and source sets, computes cosine similarity, refines weights with dynamic thresholds, and aggregates tokens. MaRe reconstructs tokens using the stored fusion matrix.

In practice

Apply MaMe to the last transformer block for accuracy gains.
Use a similarity threshold of 0.8 for balanced performance.
Integrate MaMe+MaRe for improved image synthesis quality.

Topics

Token Merging
Token Restoration
Vision Transformers
Computational Efficiency
Image Synthesis

Code references

cominder/mame

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.