Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

"Plug-and-Adapt" is a novel method for Multimodal Coreference Resolution (MCR) that addresses limitations of existing approaches requiring extensive training data or reliance on inaccessible Vision-Language Large Models (VLLMs). This method strategically adapts a pre-trained fine-grained alignment model, which connects textual and visual contextual information using vision-language alignment datasets. It then repurposes this alignment model for MCR by aggregating similarities and fusing visual and categorical cues through evidence theory. This design eliminates the need for training on scarce benchmark datasets. Experiments on the Coreference Image Narratives (CIN) benchmark dataset show "Plug-and-Adapt" achieves a 5.31% CoNLL F1 improvement over state-of-the-art dedicated methods and a 2.12% improvement over popular VLLMs. Further evaluations confirm its robustness on a masked CIN dataset and generalization capabilities on a specially constructed VCR-MCR dataset.

Key takeaway

For NLP Engineers developing multimodal coreference resolution systems, "Plug-and-Adapt" offers a compelling alternative to data-intensive training or costly VLLM APIs. You should consider integrating pre-trained alignment models and evidence theory-based fusion to achieve strong zero-shot performance. This approach can significantly reduce annotation efforts and deployment costs, allowing you to deploy effective MCR solutions immediately without extensive dataset-specific training.

Key insights

Plug-and-Adapt repurposes a pre-trained vision-language alignment model for zero-shot multimodal coreference resolution, outperforming existing methods.

Principles

Visual information enhances coreference resolution.
Pre-trained alignment models enable zero-shot MCR.
Evidence theory can fuse multimodal cues.

Method

Pre-train a fine-grained text-visual alignment model using vision-language datasets. Repurpose it for MCR by aggregating similarities and fusing visual/categorical cues via evidence theory.

In practice

Apply pre-trained alignment models to MCR.
Use similarity aggregation for multimodal fusion.
Evaluate MCR on masked and VCR-MCR datasets.

Topics

Multimodal Coreference Resolution
Zero-shot Learning
Vision-Language Alignment
Evidence Theory
Coreference Image Narratives

Best for: Research Scientist, AI Scientist, NLP Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.