A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
Summary
A study by researchers from The Ohio State University, National University of Singapore, Boston University, and the University of Mississippi investigates failure modes in two-stage Human-Object Interaction (HOI) detection models. Instead of relying on aggregate metrics like mean Average Precision (mAP), the study decomposes HOI detection into interpretable perspectives, analyzing model behavior across various human-object-interaction configurations. The team curated a subset of images from the HICO-DET benchmark, organizing them by factors such as multi-person interactions and object sharing. They found that all evaluated models (ADA-CM, CMMP, HOLa, LAIN) show a performance drop in multi-person scenarios compared to single-person scenes, with category C (multiple people, same object label, same interaction) exhibiting consistently lower performance. Verb prediction errors emerged as a dominant failure type, often persisting even at high confidence, suggesting models struggle with distinguishing semantically related interactions.
Key takeaway
For research scientists developing HOI detection models, you should prioritize fine-grained evaluation beyond mAP to uncover specific failure patterns. Focus on improving model performance in complex multi-person scenarios, particularly those involving multiple instances of the same object class. Addressing object-conditioned verb biases and enhancing instance-level representations, as suggested by HOLa and LAIN's lower pairing error rates, will be crucial for developing more robust and reliable HOI systems.
Key insights
HOI models struggle with multi-person scenes and object-conditioned verb biases, leading to persistent, high-confidence errors.
Principles
- Aggregate metrics obscure specific failure modes.
- Multi-person scenes are inherently more challenging.
- Object-conditioned biases influence verb predictions.
Method
Decompose HOI detection into interpretable configurations (e.g., single/multi-person, object sharing, interaction consistency) and analyze error types (e.g., pairing, verb classification) on a curated dataset subset.
In practice
- Focus on multi-person scenarios for robust HOI model development.
- Address instance-level ambiguity in object detection.
- Mitigate object-conditioned verb biases in training data.
Topics
- Human-Object Interaction Detection
- Two-Stage HOI Models
- Failure Mode Analysis
- HICO-DET Dataset
- Multi-Person Interaction
Best for: Research Scientist, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.