HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

2026-04-30 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

AdaAct is a novel HOI-aware adaptive network designed for weakly-supervised action segmentation, addressing the ambiguity in distinguishing similar actions like "pouring juice" versus "pouring coffee." Unlike prior methods that use fixed networks and local frame features, AdaAct exploits temporally global but spatially local human-object interaction (HOI) as video-level prior knowledge. The network dynamically adapts its parameters based on the given HOI sequence during testing. It features a video HOI encoder that extracts, selects, and integrates representative HOI, and a two-branch HyperNetwork that learns an adaptive temporal encoder. This encoder automatically adjusts parameters using both HOI-dependent and HOI-independent knowledge. Extensive experiments on the Breakfast and 50Salads datasets demonstrate AdaAct's effectiveness, achieving state-of-the-art results with improvements of 1.4% MoF and 1.2% MoF-BG on Breakfast, and 0.9% MoF and 0.5% MoF-BG on 50Salads for action segmentation.

Key takeaway

For research scientists developing weakly-supervised action segmentation models, AdaAct demonstrates that incorporating dynamic, HOI-aware contextual information significantly improves performance, especially for distinguishing ambiguous actions. You should consider integrating a two-branch HyperNetwork architecture to adapt temporal encoder parameters based on both video-specific HOI and general instructional video characteristics, as this approach yields state-of-the-art results on challenging datasets like Breakfast and 50Salads.

Key insights

Leveraging human-object interaction (HOI) context dynamically improves weakly-supervised action segmentation accuracy for ambiguous actions.

Principles

Global HOI context resolves local action ambiguity.
Adaptive networks outperform fixed models for diverse video content.
Combine HOI-dependent and HOI-independent knowledge for robustness.

Method

AdaAct uses a video HOI encoder (extracting, selecting, integrating) and a two-branch HyperNetwork to dynamically adapt a GRU-based temporal encoder's parameters based on HOI and general video characteristics.

In practice

Use pre-trained HOI detectors for video frame analysis.
Implement video-NMS to select top-K representative HOI bounding boxes.
Employ a ViT-based network to integrate HOI embeddings.

Topics

Weakly-supervised Action Segmentation
Human-Object Interaction
Adaptive Networks
HyperNetwork Architecture
Video HOI Encoder

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.