A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

2026-06-03 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A systematic evaluation investigated positional bias in Multimodal Large Language Models (MLLMs) when performing multi-video summarization. This bias manifests as changes in per-video summary quality based on the video's input slot, even with identical content. Researchers developed a benchmark using ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News domains with two- and four-video inputs. Nine open-source and proprietary MLLMs were assessed using Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG) metrics. Results indicate that positional effects vary by domain and model, with signed directional bias potentially small even when middle positions underperform. Increasing visual or generation budget did not consistently eliminate this imbalance. The study concludes that multi-video summarization remains sensitive to input protocol and position, highlighting the need for more robust, order-invariant multimodal systems.

Key takeaway

For Machine Learning Engineers designing or deploying MLLMs for multi-video summarization, recognize that input order significantly influences summary quality. Your models may exhibit positional bias that varies by domain and specific MLLM, even with increased processing resources. Prioritize testing for this sensitivity and explore prompt-level mitigation strategies or select MLLMs engineered for order-invariance to ensure consistent and reliable output across diverse video inputs.

Key insights

Positional bias significantly impacts Multimodal Large Language Models' multi-video summarization performance, varying by model and domain.

Principles

Positional effects in MLLM multi-video summarization are domain- and model-dependent.
Increasing visual or generation budget does not uniformly remove positional imbalance.

Method

Evaluated nine MLLMs on a benchmark from ActivityNet and News videos using Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG) metrics.

In practice

Test MLLMs for positional bias in multi-video tasks.
Analyze prompt-level mitigation methods for input order sensitivity.

Topics

Multi-Video Summarization
Positional Bias
MLLMs
ActivityNet
Model Evaluation
Input Protocol

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.