GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver
Summary
GenEraser is a novel framework designed for generalized and high-fidelity video object and effect removal, addressing challenges like complex spatiotemporal ambiguities and the failure of spatial masks to capture weakly correlated effects. It tackles the fundamental optimization conflict between high-level semantic generalization and precise pixel-level background preservation. The system introduces a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit Diffusion Transformers for identifying complex effects. A Learnable Deep "CFG" Fusion (LD-CFG) mechanism adaptively balances mask and textual conditions. Furthermore, a Decoupled Expert Architecture, comprising a Locator and a Preserver, mitigates the inherent trade-off. GenEraser significantly outperforms state-of-the-art methods, achieving 2.16 dB on the ROSE Benchmark and 1.44 dB on VOR-Eval, demonstrating robust generalization in open-world scenarios.
Key takeaway
For Computer Vision Engineers developing advanced video editing tools, GenEraser offers a robust solution for complex object and effect removal. Its novel architecture, combining multimodal guidance and a decoupled expert system, directly addresses the trade-off between semantic generalization and pixel-level background preservation. You should consider integrating similar balanced text-mask guidance and decoupled processing to achieve superior fidelity and generalization in your video processing pipelines.
Key insights
GenEraser enhances video object removal by integrating multimodal guidance and a decoupled architecture to manage complex effects and preserve backgrounds.
Principles
- Spatial masks alone miss weakly correlated effects.
- Textual guidance improves complex effect identification.
- Semantic generalization conflicts with pixel preservation.
Method
GenEraser employs MC-MoE with Bipartite Text guidance, LD-CFG for condition balancing, and a Decoupled Expert Architecture (Locator, Preserver) to resolve semantic-pixel trade-offs.
In practice
- Removes objects and associated physical effects.
- Maintains fidelity in out-of-domain videos.
- Improves generalization for complex scenarios.
Topics
- Video Object Removal
- Diffusion Transformers
- Multi-Conditional Mixture-of-Experts
- Text-Mask Guidance
- Decoupled Architecture
- Computer Vision
Best for: Research Scientist, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.