FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

2026-01-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, short

Summary

The FutureOmni benchmark, introduced in a paper accepted by ICML 2026, addresses the gap in evaluating Multimodal Large Language Models' (MLLMs) ability to forecast future events from audio-visual environments. This novel benchmark, the first of its kind, requires models to perform cross-modal causal and temporal reasoning, alongside leveraging internal knowledge. FutureOmni was developed using an LLM-assisted, human-in-the-loop pipeline and comprises 919 videos and 1,034 multiple-choice QA pairs spanning 8 distinct domains. Initial evaluations across 13 omni-modal and 7 video-only models reveal that current systems struggle significantly with audio-visual future prediction, particularly in scenarios rich with speech, with Gemini 3 Flash achieving the highest accuracy at 64.8%. To improve performance, the researchers curated a 7K-sample instruction-tuning dataset and proposed an Omni-Modal Future Forecasting (OFF) training strategy, which demonstrated enhanced future forecasting and generalization on FutureOmni and other benchmarks. All code and datasets are publicly available.

Key takeaway

For AI scientists and ML engineers developing Multimodal LLMs, this research highlights a critical gap in future forecasting from audio-visual data. You should integrate the FutureOmni benchmark into your evaluation pipelines to rigorously test cross-modal causal and temporal reasoning. Consider applying the Omni-Modal Future Forecasting (OFF) training strategy with its 7K-sample dataset to enhance your models' predictive capabilities, especially for speech-heavy applications, and improve generalization.

Key insights

FutureOmni is the first benchmark and training strategy to evaluate and improve MLLM future forecasting from audio-visual context.

Principles

MLLMs struggle with audio-visual future prediction.
Cross-modal causal and temporal reasoning is key.
Internal knowledge is crucial for future forecasting.

Method

FutureOmni is built via an LLM-assisted, human-in-the-loop pipeline. The Omni-Modal Future Forecasting (OFF) strategy uses a 7K-sample instruction-tuning dataset to enhance MLLM future prediction.

In practice

Use FutureOmni benchmark for MLLM evaluation.
Apply OFF strategy for MLLM fine-tuning.
Explore speech-heavy scenarios for MLLM improvement.

Topics

Multimodal LLMs
Future Forecasting
Audio-Visual Reasoning
FutureOmni Benchmark
Instruction Tuning
Omni-Modal Future Forecasting

Code references

OpenMOSS/FutureOmni

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.