MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

2026-04-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

MaMe (Matrix-Based Token Merging) is a new training-free, differentiable token merging method designed to accelerate Vision Transformers (ViTs) by addressing the quadratic complexity of self-attention. Unlike existing methods like ToMe, MaMe uses only GPU-friendly matrix operations, avoiding inefficient sorting and scattered writes. The authors also introduce MaRe, an inverse operation for token restoration, creating a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop, and fine-tuning the last layer can boost ViT-B accuracy by 1.0% at 1.1x speed. It accelerates SigLIP2-B@512 zero-shot classification by 1.3x with negligible degradation and VideoMAE-L by 48.5% on Kinetics-400 with a 0.84% accuracy loss. For image synthesis, the MaMe+MaRe pipeline reduces Stable Diffusion v2.1 generation latency by 31% while enhancing quality.

Key takeaway

For AI Engineers optimizing Vision Transformer inference, MaMe offers a significant performance boost without complex retraining. You should consider integrating MaMe into your existing ViT pipelines to achieve up to double throughput or improve accuracy with minimal speed overhead, especially for large models or real-time applications. The MaMe+MaRe pipeline also provides a clear path to accelerate image synthesis models like Stable Diffusion, reducing generation latency by nearly a third.

Key insights

MaMe and MaRe offer GPU-efficient, matrix-based token merging and restoration for accelerating Vision Transformers and image synthesis.

Principles

Matrix operations enhance GPU efficiency.
Token merging reduces quadratic complexity.
Inverse operations enable synthesis pipelines.

Method

MaMe performs training-free, differentiable token merging via matrix operations. MaRe is its inverse for token restoration, forming a pipeline for tasks like image synthesis.

In practice

Apply MaMe to pre-trained ViTs for 2x throughput.
Fine-tune last layer with MaMe for accuracy gains.
Use MaMe+MaRe for faster Stable Diffusion.

Topics

MaMe
MaRe
Vision Transformers
Token Merging
Image Synthesis

Code references

cominder/mame

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.