SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction
Summary
Semantic Anchor-aligned Multimodal Augmentation (SAMA) is a unified framework designed to address data scarcity in Multimodal Information Extraction (MIE) tasks, including Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE). It overcomes limitations of existing methods by constructing structured semantic anchors from ground-truth labels. SAMA employs a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM) with Universal and Task-Specific Adapters for generating diverse, constraint-compliant textual samples. For image synthesis, it uses an Anchor-Preserving Diffusion mechanism that maintains critical semantic anchors while diversifying visual contexts. A Dual-Constraint Filtering module automatically selects high-fidelity synthetic samples based on cross-modal consistency and anchor fidelity. Experiments show SAMA consistently outperforms state-of-the-art baselines, achieving a 1.7% F1 improvement over GMDA on MNER (10% setting), a 2.0% F1 boost for HVPNeT on MRE (10% split), and up to 5.9% F1 on MEE (10% setting), even providing gains in fully supervised scenarios.
Key takeaway
For AI Scientists and Machine Learning Engineers addressing data scarcity in multimodal information extraction, SAMA offers a robust, unified data augmentation solution. You should consider integrating SAMA's anchor-driven generation and dual-constraint filtering to create high-fidelity synthetic data, significantly improving model generalization and performance in low-resource MNER, MRE, and MEE tasks. This approach reduces reliance on costly manual annotations and enhances model robustness even in fully supervised settings.
Key insights
SAMA unifies multimodal data augmentation for MIE tasks using semantic anchors to guide high-fidelity text and image synthesis.
Principles
- Semantic anchors enforce strict cross-modal consistency.
- Collaborative experts balance shared and task-specific knowledge.
- Anchor-preserving diffusion prevents visual identity drift.
Method
SAMA constructs semantic anchors from ground-truth labels, uses a CME-MLLM for text generation, and an Anchor-Preserving Diffusion for image synthesis, followed by Dual-Constraint Filtering for quality control.
In practice
- Use structured tags for entity, relation, and event anchors.
- Employ LoRA for universal and task-specific adapters.
- Filter synthetic data with cross-modal and anchor fidelity scores.
Topics
- Multimodal Information Extraction
- Data Augmentation
- Large Language Models
- Diffusion Models
- Named Entity Recognition
- Relation Extraction
- Event Extraction
Code references
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.