OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
Summary
OmniTrace is a novel, lightweight, and model-agnostic framework designed for generation-time attribution in omni-modal Large Language Models (LLMs). It addresses the challenge of identifying which interleaved input sources (text, image, audio, video) support each generated statement in autoregressive, decoder-only MLLMs. OmniTrace formalizes attribution as a tracing problem over the causal decoding process, converting token-level signals like attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. The framework operates without retraining or supervision, tracing each generated token to multimodal inputs, aggregating signals into semantically meaningful spans, and selecting concise supporting sources through confidence-weighted and temporally coherent aggregation. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that OmniTrace produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines.
Key takeaway
For research scientists developing or deploying omni-modal LLMs, OmniTrace offers a robust solution for model interpretability. You should integrate this framework to provide clear, generation-time explanations of how model outputs are grounded in diverse input modalities. This enhances transparency and trustworthiness, particularly in applications requiring justification for generated content, such as multimodal summarization or decision support systems.
Key insights
OmniTrace provides a unified, generation-aware framework for attributing MLLM outputs to diverse input modalities.
Principles
- Attribution must be generation-aware for decoder-only MLLMs.
- Explanations require span-level semantic coherence.
- Confidence and temporal consistency improve attribution quality.
Method
OmniTrace maps generated tokens to influential input sources, aggregates attribution mass with POS-aware weighting and confidence shaping, then selects concise supporting spans using threshold filtering and run-level coherence.
In practice
- Use OmniTrace for transparent grounding in MLLM applications.
- Apply POS-aware weighting to prioritize content-bearing tokens.
- Filter weak signals to avoid spurious cross-modal connections.
Topics
- Omni-Modal LLMs
- Generation-Time Attribution
- Model Interpretability
- Cross-Modal Explanations
- Source Curation Pipeline
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.