GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

GenEraser is a novel framework designed for generalized and high-fidelity video object and effect removal, addressing challenges like complex spatiotemporal ambiguities and the failure of spatial masks to capture weakly correlated effects. It tackles the fundamental optimization conflict between high-level semantic generalization and precise pixel-level background preservation. The system introduces a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit Diffusion Transformers for identifying complex effects. A Learnable Deep "CFG" Fusion (LD-CFG) mechanism adaptively balances mask and textual conditions. Furthermore, a Decoupled Expert Architecture, comprising a Locator and a Preserver, mitigates the inherent trade-off. GenEraser significantly outperforms state-of-the-art methods, achieving 2.16 dB on the ROSE Benchmark and 1.44 dB on VOR-Eval, demonstrating robust generalization in open-world scenarios.

Key takeaway

For Computer Vision Engineers developing advanced video editing tools, GenEraser offers a robust solution for complex object and effect removal. Its novel architecture, combining multimodal guidance and a decoupled expert system, directly addresses the trade-off between semantic generalization and pixel-level background preservation. You should consider integrating similar balanced text-mask guidance and decoupled processing to achieve superior fidelity and generalization in your video processing pipelines.

Key insights

GenEraser enhances video object removal by integrating multimodal guidance and a decoupled architecture to manage complex effects and preserve backgrounds.

Principles

Method

GenEraser employs MC-MoE with Bipartite Text guidance, LD-CFG for condition balancing, and a Decoupled Expert Architecture (Locator, Preserver) to resolve semantic-pixel trade-offs.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.