GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

GRACE, a grounded action-centric evidence augmentation framework, significantly boosts video Multimodal Large Language Models (MLLMs) for viewer sentiment prediction in video advertisements. Standard MLLMs struggle with fine-grained, affect-relevant events due to reliance on holistic frame representations. GRACE addresses this by extracting temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions. It then grounds subject and object entities as visual crops, enabling MLLMs to perform clue-enhanced emotional reasoning. This method clarifies "what happens" and anchors "who or what participates" to concrete visual evidence. Experiments demonstrate consistent improvements over Qwen2.5-VL and Qwen3-VL baselines on the Pitts dataset, with further validation on AdsQA and an emotion-focused TVQA subset.

Key takeaway

For Machine Learning Engineers developing video sentiment analysis models, you should consider integrating grounded action-centric evidence frameworks like GRACE. This approach significantly improves MLLM performance by providing explicit event structures and localized visual cues, moving beyond holistic frame representations. Implementing SVO triplet extraction and visual entity grounding can enhance your model's ability to deduce nuanced viewer emotions from video content.

Key insights

Video MLLMs predict viewer sentiment better by using grounded action-centric evidence and explicit event structures.

Principles

Explicit event structure improves MLLM emotional reasoning.
Grounding entities visually anchors abstract events.
Action-centric descriptions enhance clue extraction.

Method

GRACE extracts temporally ordered SVO triplets and visible textual cues from video descriptions. It grounds subject/object entities as visual crops, then feeds these structured clues to MLLMs for enhanced emotional reasoning.

In practice

Augment video MLLMs with SVO triplets.
Use visual entity crops for grounding.
Apply to video ad sentiment analysis.

Topics

Video Sentiment Prediction
Multimodal Large Language Models
Action-Centric Evidence
Subject-Verb-Object (SVO) Triplets
Visual Grounding
Video Advertisements

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.