OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Interpretability & Explainability · Depth: Expert, extended

Summary

OmniTrace is a novel, lightweight, and model-agnostic framework designed for generation-time attribution in omni-modal Large Language Models (LLMs). It addresses the challenge of identifying which interleaved input sources (text, image, audio, video) support each generated statement in autoregressive, decoder-only MLLMs. OmniTrace formalizes attribution as a tracing problem over the causal decoding process, converting token-level signals like attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. The framework operates without retraining or supervision, tracing each generated token to multimodal inputs, aggregating signals into semantically meaningful spans, and selecting concise supporting sources through confidence-weighted and temporally coherent aggregation. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that OmniTrace produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines.

Key takeaway

For research scientists developing or deploying omni-modal LLMs, OmniTrace offers a robust solution for model interpretability. You should integrate this framework to provide clear, generation-time explanations of how model outputs are grounded in diverse input modalities. This enhances transparency and trustworthiness, particularly in applications requiring justification for generated content, such as multimodal summarization or decision support systems.

Key insights

OmniTrace provides a unified, generation-aware framework for attributing MLLM outputs to diverse input modalities.

Principles

Method

OmniTrace maps generated tokens to influential input sources, aggregates attribution mass with POS-aware weighting and confidence shaping, then selects concise supporting spans using threshold filtering and run-level coherence.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.