A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Summary
A systematic evaluation investigated positional bias in Multimodal Large Language Models (MLLMs) when performing multi-video summarization. This bias manifests as changes in per-video summary quality based on the video's input slot, even with identical content. Researchers developed a benchmark using ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News domains with two- and four-video inputs. Nine open-source and proprietary MLLMs were assessed using Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG) metrics. Results indicate that positional effects vary by domain and model, with signed directional bias potentially small even when middle positions underperform. Increasing visual or generation budget did not consistently eliminate this imbalance. The study concludes that multi-video summarization remains sensitive to input protocol and position, highlighting the need for more robust, order-invariant multimodal systems.
Key takeaway
For Machine Learning Engineers designing or deploying MLLMs for multi-video summarization, recognize that input order significantly influences summary quality. Your models may exhibit positional bias that varies by domain and specific MLLM, even with increased processing resources. Prioritize testing for this sensitivity and explore prompt-level mitigation strategies or select MLLMs engineered for order-invariance to ensure consistent and reliable output across diverse video inputs.
Key insights
Positional bias significantly impacts Multimodal Large Language Models' multi-video summarization performance, varying by model and domain.
Principles
- Positional effects in MLLM multi-video summarization are domain- and model-dependent.
- Increasing visual or generation budget does not uniformly remove positional imbalance.
Method
Evaluated nine MLLMs on a benchmark from ActivityNet and News videos using Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG) metrics.
In practice
- Test MLLMs for positional bias in multi-video tasks.
- Analyze prompt-level mitigation methods for input order sensitivity.
Topics
- Multi-Video Summarization
- Positional Bias
- MLLMs
- ActivityNet
- Model Evaluation
- Input Protocol
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.