Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model
Summary
"Plug-and-Adapt" is a novel method for Multimodal Coreference Resolution (MCR) that addresses limitations of existing approaches requiring extensive training data or reliance on inaccessible Vision-Language Large Models (VLLMs). This method strategically adapts a pre-trained fine-grained alignment model, which connects textual and visual contextual information using vision-language alignment datasets. It then repurposes this alignment model for MCR by aggregating similarities and fusing visual and categorical cues through evidence theory. This design eliminates the need for training on scarce benchmark datasets. Experiments on the Coreference Image Narratives (CIN) benchmark dataset show "Plug-and-Adapt" achieves a 5.31% CoNLL F1 improvement over state-of-the-art dedicated methods and a 2.12% improvement over popular VLLMs. Further evaluations confirm its robustness on a masked CIN dataset and generalization capabilities on a specially constructed VCR-MCR dataset.
Key takeaway
For NLP Engineers developing multimodal coreference resolution systems, "Plug-and-Adapt" offers a compelling alternative to data-intensive training or costly VLLM APIs. You should consider integrating pre-trained alignment models and evidence theory-based fusion to achieve strong zero-shot performance. This approach can significantly reduce annotation efforts and deployment costs, allowing you to deploy effective MCR solutions immediately without extensive dataset-specific training.
Key insights
Plug-and-Adapt repurposes a pre-trained vision-language alignment model for zero-shot multimodal coreference resolution, outperforming existing methods.
Principles
- Visual information enhances coreference resolution.
- Pre-trained alignment models enable zero-shot MCR.
- Evidence theory can fuse multimodal cues.
Method
Pre-train a fine-grained text-visual alignment model using vision-language datasets. Repurpose it for MCR by aggregating similarities and fusing visual/categorical cues via evidence theory.
In practice
- Apply pre-trained alignment models to MCR.
- Use similarity aggregation for multimodal fusion.
- Evaluate MCR on masked and VCR-MCR datasets.
Topics
- Multimodal Coreference Resolution
- Zero-shot Learning
- Vision-Language Alignment
- Evidence Theory
- Coreference Image Narratives
Best for: Research Scientist, AI Scientist, NLP Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.