A Study of Failure Modes in Two-Stage Human-Object Interaction Detection

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

A study by researchers from The Ohio State University, National University of Singapore, Boston University, and the University of Mississippi investigates failure modes in two-stage Human-Object Interaction (HOI) detection models. Instead of relying on aggregate metrics like mean Average Precision (mAP), the study decomposes HOI detection into interpretable perspectives, analyzing model behavior across various human-object-interaction configurations. The team curated a subset of images from the HICO-DET benchmark, organizing them by factors such as multi-person interactions and object sharing. They found that all evaluated models (ADA-CM, CMMP, HOLa, LAIN) show a performance drop in multi-person scenarios compared to single-person scenes, with category C (multiple people, same object label, same interaction) exhibiting consistently lower performance. Verb prediction errors emerged as a dominant failure type, often persisting even at high confidence, suggesting models struggle with distinguishing semantically related interactions.

Key takeaway

For research scientists developing HOI detection models, you should prioritize fine-grained evaluation beyond mAP to uncover specific failure patterns. Focus on improving model performance in complex multi-person scenarios, particularly those involving multiple instances of the same object class. Addressing object-conditioned verb biases and enhancing instance-level representations, as suggested by HOLa and LAIN's lower pairing error rates, will be crucial for developing more robust and reliable HOI systems.

Key insights

HOI models struggle with multi-person scenes and object-conditioned verb biases, leading to persistent, high-confidence errors.

Principles

Aggregate metrics obscure specific failure modes.
Multi-person scenes are inherently more challenging.
Object-conditioned biases influence verb predictions.

Method

Decompose HOI detection into interpretable configurations (e.g., single/multi-person, object sharing, interaction consistency) and analyze error types (e.g., pairing, verb classification) on a curated dataset subset.

In practice

Focus on multi-person scenarios for robust HOI model development.
Address instance-level ambiguity in object detection.
Mitigate object-conditioned verb biases in training data.

Topics

Human-Object Interaction Detection
Two-Stage HOI Models
Failure Mode Analysis
HICO-DET Dataset
Multi-Person Interaction

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.