Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A new study introduces HOI-Edit, a comprehensive benchmark designed to address the limitations of current image editing methods in handling complex Human-Object Interactions (HOI). Existing benchmarks conflate HOI with static attributes, lacking metrics for dynamic interaction validity and entangled human-object pair preservation. HOI-Edit features three progressive cognitive levels and an automated metric, HOI-Eval, which uses VLM Q&A for reliable instance-level interaction evaluation. The research identifies Image-to-Video (I2V) models as inherently suited for dynamic HOI editing due to their temporal generation capabilities, which also provide unique error diagnosability. Building on this, the paper proposes SCPE (Self-Correcting Process Editing), an agentic self-correcting framework. SCPE constrains I2V model generation through iteratively refined prompts, enabling more accurate target HOI representation in generated videos, from which final editing results are extracted. SCPE achieves performance competitive with leading editing models like Nano Banana on the HOI-Edit benchmark. Code is available at https://github.com/oceanflowlab/HOI-Edit.

Key takeaway

For Computer Vision Engineers focused on complex Human-Object Interaction (HOI) editing, current static attribute methods and global metrics are inadequate. You should consider adopting Image-to-Video (I2V) models, as their temporal generation capabilities are inherently suited for dynamic interactions and offer valuable error diagnosability. Implement agentic self-correcting frameworks like SCPE, which use iteratively refined prompts to achieve more accurate HOI representations. This approach provides a robust method for tackling previously challenging dynamic editing tasks.

Key insights

A new benchmark and agentic self-correcting framework using I2V models significantly improve complex Human-Object Interaction (HOI) image editing.

Principles

HOI editing needs dynamic interaction assessment.
I2V models suit dynamic editing via temporal generation.
Iterative prompt refinement constrains I2V generation.

Method

The SCPE framework constrains Image-to-Video (I2V) models using iteratively refined prompts to generate videos accurately depicting target HOI. Final edited images are extracted frames. Evaluation uses HOI-Edit benchmark with VLM Q&A.

In practice

Apply I2V models for dynamic image editing.
Evaluate interactions using VLM Q&A metrics.
Refine I2V prompts iteratively for precision.

Topics

Human-Object Interaction
Image-to-Video Models
Agentic AI
Image Editing
Benchmarking
Prompt Engineering

Code references

oceanflowlab/HOI-Edit

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.