Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework
Summary
A new study introduces HOI-Edit, a comprehensive benchmark designed to address the limitations of current image editing methods in handling complex Human-Object Interactions (HOI). Existing benchmarks conflate HOI with static attributes, lacking metrics for dynamic interaction validity and entangled human-object pair preservation. HOI-Edit features three progressive cognitive levels and an automated metric, HOI-Eval, which uses VLM Q&A for reliable instance-level interaction evaluation. The research identifies Image-to-Video (I2V) models as inherently suited for dynamic HOI editing due to their temporal generation capabilities, which also provide unique error diagnosability. Building on this, the paper proposes SCPE (Self-Correcting Process Editing), an agentic self-correcting framework. SCPE constrains I2V model generation through iteratively refined prompts, enabling more accurate target HOI representation in generated videos, from which final editing results are extracted. SCPE achieves performance competitive with leading editing models like Nano Banana on the HOI-Edit benchmark. Code is available at https://github.com/oceanflowlab/HOI-Edit.
Key takeaway
For Computer Vision Engineers focused on complex Human-Object Interaction (HOI) editing, current static attribute methods and global metrics are inadequate. You should consider adopting Image-to-Video (I2V) models, as their temporal generation capabilities are inherently suited for dynamic interactions and offer valuable error diagnosability. Implement agentic self-correcting frameworks like SCPE, which use iteratively refined prompts to achieve more accurate HOI representations. This approach provides a robust method for tackling previously challenging dynamic editing tasks.
Key insights
A new benchmark and agentic self-correcting framework using I2V models significantly improve complex Human-Object Interaction (HOI) image editing.
Principles
- HOI editing needs dynamic interaction assessment.
- I2V models suit dynamic editing via temporal generation.
- Iterative prompt refinement constrains I2V generation.
Method
The SCPE framework constrains Image-to-Video (I2V) models using iteratively refined prompts to generate videos accurately depicting target HOI. Final edited images are extracted frames. Evaluation uses HOI-Edit benchmark with VLM Q&A.
In practice
- Apply I2V models for dynamic image editing.
- Evaluate interactions using VLM Q&A metrics.
- Refine I2V prompts iteratively for precision.
Topics
- Human-Object Interaction
- Image-to-Video Models
- Agentic AI
- Image Editing
- Benchmarking
- Prompt Engineering
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.