GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

GenEraser is a novel framework designed for generalized and high-fidelity video object and effect removal, addressing challenges like complex spatiotemporal ambiguities and the failure of spatial masks to capture weakly correlated effects. It tackles the fundamental optimization conflict between high-level semantic generalization and precise pixel-level background preservation. The system introduces a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit Diffusion Transformers for identifying complex effects. A Learnable Deep "CFG" Fusion (LD-CFG) mechanism adaptively balances mask and textual conditions. Furthermore, a Decoupled Expert Architecture, comprising a Locator and a Preserver, mitigates the inherent trade-off. GenEraser significantly outperforms state-of-the-art methods, achieving 2.16 dB on the ROSE Benchmark and 1.44 dB on VOR-Eval, demonstrating robust generalization in open-world scenarios.

Key takeaway

For Computer Vision Engineers developing advanced video editing tools, GenEraser offers a robust solution for complex object and effect removal. Its novel architecture, combining multimodal guidance and a decoupled expert system, directly addresses the trade-off between semantic generalization and pixel-level background preservation. You should consider integrating similar balanced text-mask guidance and decoupled processing to achieve superior fidelity and generalization in your video processing pipelines.

Key insights

GenEraser enhances video object removal by integrating multimodal guidance and a decoupled architecture to manage complex effects and preserve backgrounds.

Principles

Spatial masks alone miss weakly correlated effects.
Textual guidance improves complex effect identification.
Semantic generalization conflicts with pixel preservation.

Method

GenEraser employs MC-MoE with Bipartite Text guidance, LD-CFG for condition balancing, and a Decoupled Expert Architecture (Locator, Preserver) to resolve semantic-pixel trade-offs.

In practice

Removes objects and associated physical effects.
Maintains fidelity in out-of-domain videos.
Improves generalization for complex scenarios.

Topics

Video Object Removal
Diffusion Transformers
Multi-Conditional Mixture-of-Experts
Text-Mask Guidance
Decoupled Architecture
Computer Vision

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.