HOI-aware Adaptive Network for Weakly-supervised Action Segmentation
Summary
AdaAct is a novel HOI-aware adaptive network designed for weakly-supervised action segmentation, addressing the ambiguity in distinguishing similar actions like "pouring juice" versus "pouring coffee." Unlike prior methods that use fixed networks and local frame features, AdaAct exploits temporally global but spatially local human-object interaction (HOI) as video-level prior knowledge. The network dynamically adapts its parameters based on the given HOI sequence during testing. It features a video HOI encoder that extracts, selects, and integrates representative HOI, and a two-branch HyperNetwork that learns an adaptive temporal encoder. This encoder automatically adjusts parameters using both HOI-dependent and HOI-independent knowledge. Extensive experiments on the Breakfast and 50Salads datasets demonstrate AdaAct's effectiveness, achieving state-of-the-art results with improvements of 1.4% MoF and 1.2% MoF-BG on Breakfast, and 0.9% MoF and 0.5% MoF-BG on 50Salads for action segmentation.
Key takeaway
For research scientists developing weakly-supervised action segmentation models, AdaAct demonstrates that incorporating dynamic, HOI-aware contextual information significantly improves performance, especially for distinguishing ambiguous actions. You should consider integrating a two-branch HyperNetwork architecture to adapt temporal encoder parameters based on both video-specific HOI and general instructional video characteristics, as this approach yields state-of-the-art results on challenging datasets like Breakfast and 50Salads.
Key insights
Leveraging human-object interaction (HOI) context dynamically improves weakly-supervised action segmentation accuracy for ambiguous actions.
Principles
- Global HOI context resolves local action ambiguity.
- Adaptive networks outperform fixed models for diverse video content.
- Combine HOI-dependent and HOI-independent knowledge for robustness.
Method
AdaAct uses a video HOI encoder (extracting, selecting, integrating) and a two-branch HyperNetwork to dynamically adapt a GRU-based temporal encoder's parameters based on HOI and general video characteristics.
In practice
- Use pre-trained HOI detectors for video frame analysis.
- Implement video-NMS to select top-K representative HOI bounding boxes.
- Employ a ViT-based network to integrate HOI embeddings.
Topics
- Weakly-supervised Action Segmentation
- Human-Object Interaction
- Adaptive Networks
- HyperNetwork Architecture
- Video HOI Encoder
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.