GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction
Summary
GRACE, a grounded action-centric evidence augmentation framework, significantly boosts video Multimodal Large Language Models (MLLMs) for viewer sentiment prediction in video advertisements. Standard MLLMs struggle with fine-grained, affect-relevant events due to reliance on holistic frame representations. GRACE addresses this by extracting temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions. It then grounds subject and object entities as visual crops, enabling MLLMs to perform clue-enhanced emotional reasoning. This method clarifies "what happens" and anchors "who or what participates" to concrete visual evidence. Experiments demonstrate consistent improvements over Qwen2.5-VL and Qwen3-VL baselines on the Pitts dataset, with further validation on AdsQA and an emotion-focused TVQA subset.
Key takeaway
For Machine Learning Engineers developing video sentiment analysis models, you should consider integrating grounded action-centric evidence frameworks like GRACE. This approach significantly improves MLLM performance by providing explicit event structures and localized visual cues, moving beyond holistic frame representations. Implementing SVO triplet extraction and visual entity grounding can enhance your model's ability to deduce nuanced viewer emotions from video content.
Key insights
Video MLLMs predict viewer sentiment better by using grounded action-centric evidence and explicit event structures.
Principles
- Explicit event structure improves MLLM emotional reasoning.
- Grounding entities visually anchors abstract events.
- Action-centric descriptions enhance clue extraction.
Method
GRACE extracts temporally ordered SVO triplets and visible textual cues from video descriptions. It grounds subject/object entities as visual crops, then feeds these structured clues to MLLMs for enhanced emotional reasoning.
In practice
- Augment video MLLMs with SVO triplets.
- Use visual entity crops for grounding.
- Apply to video ad sentiment analysis.
Topics
- Video Sentiment Prediction
- Multimodal Large Language Models
- Action-Centric Evidence
- Subject-Verb-Object (SVO) Triplets
- Visual Grounding
- Video Advertisements
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.