Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
Summary
A new audit, Facet-Probe, investigates order sensitivity in 18 frontier and open-weight multimodal large language models (MLLMs) across five facets: option, evidence-chunk, document-rank, image-set, and mixed-modality ordering. Standard benchmarks often overlook whether shuffling order-irrelevant inputs changes MLLM answers, a critical reliability property. The study employs a Bayesian item-response model to distinguish ordering noise from per-facet bias and uses a same-ordering control to estimate the decoder-stochastic floor. Results indicate none of the 18 MLLMs are order-invariant, with screened per-facet panel-mean flip rates ranging from 24% to 50%. Even the best model exhibits flips on 13.4% of trials, demonstrating a substantial ordering excess over decoder noise. Training-free prompt changes tested on Gemini showed modality-conditional effects, not transferring from text to visual reasoning, suggesting prompt-level mitigation alone is insufficient for general order robustness. The authors propose cross-ordering flip rate as a standard reporting axis for MLLMs.
Key takeaway
For MLOps Engineers deploying multimodal LLMs, you must account for significant order sensitivity. Your models, even top-performing ones, will likely produce different answers when input elements are merely reordered, impacting reliability. Do not rely solely on prompt engineering for robustness, as these changes are often modality-specific. Integrate cross-ordering flip rate into your evaluation pipelines to ensure consistent model behavior under varied input presentations.
Key insights
Multimodal LLMs exhibit significant order sensitivity, with current prompt-level mitigations proving insufficient for general robustness.
Principles
- MLLM reliability requires auditing order-irrelevant input shuffling.
- Capability does not eliminate order sensitivity in MLLMs.
- Prompt-level mitigations are modality-conditional and non-transferable.
Method
Facet-Probe audits MLLMs across five ordering facets, using a Bayesian item-response model to separate noise from bias and a same-ordering control for decoder-stochastic floor estimation.
In practice
- Audit MLLMs for order sensitivity using diverse input permutations.
- Report cross-ordering flip rates as a standard MLLM metric.
Topics
- Multimodal LLMs
- Order Sensitivity
- AI Evaluation
- Model Robustness
- Facet-Probe
- Prompt Engineering
Code references
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.