Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Match-Any-Events introduces the first event matching model capable of zero-shot, cross-dataset wide-baseline correspondence for event cameras, outperforming previous methods by 37.7%. Event cameras excel at instantaneous motion but struggle with wide-baseline matching due to appearance changes with motion and limited supervision. This new model features a motion-robust, computationally efficient attention backbone that learns multi-timescale features from event streams, enhanced by sparsity-aware event token selection. This design makes large-scale training on diverse wide-baseline supervision feasible. To address the data scarcity, the researchers developed a robust event motion synthesis framework to generate extensive event-matching datasets with augmented viewpoints, modalities, and motions. The model was trained on a combined large-scale dataset, achieving state-of-the-art results in semi-dense matching and camera pose estimation on both in-domain and unseen test data without fine-tuning.

Key takeaway

For research scientists developing event-based vision systems, Match-Any-Events demonstrates a significant leap in zero-shot wide-baseline matching. You should consider adopting its principles of separable spatial-temporal attention and sparsity-aware token selection to improve generalization and computational efficiency in your models. Furthermore, explore synthetic data generation frameworks like E-MegaDepth to overcome limitations in real-world wide-baseline supervision, enabling more robust and adaptable event camera applications.

Key insights

A new event matching model achieves zero-shot wide-baseline correspondence by combining efficient architecture and diverse synthetic data.

Principles

Decouple spatial and temporal aggregation for efficiency.
Adaptively prune redundant tokens to reduce computational cost.
Synthesize diverse data for wide-baseline generalization.

Method

The method uses a Temporal Aggregation Transformer with separable spatial-temporal attention and Sparsity-aware Event Token Selection (SETS) on logarithmically windowed event voxels, followed by a Mutual Nearest Neighbors (MNN) matching stage.

In practice

Utilize E-MegaDepth for synthetic wide-baseline training.
Employ ECM dataset for real-world hetero-stereo evaluation.
Integrate DINO pretrained weights for faster convergence.

Topics

Event Cameras
Zero-Shot Feature Matching
Wide-Baseline Correspondence
Spatiotemporal Transformers
Sparsity-aware Token Selection

Code references

spikelab-jhu/Match-Any-Events

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.