SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Semantic Anchor-aligned Multimodal Augmentation (SAMA) is a unified framework designed to address data scarcity in Multimodal Information Extraction (MIE) tasks, including Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE). It overcomes limitations of existing methods by constructing structured semantic anchors from ground-truth labels. SAMA employs a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM) with Universal and Task-Specific Adapters for generating diverse, constraint-compliant textual samples. For image synthesis, it uses an Anchor-Preserving Diffusion mechanism that maintains critical semantic anchors while diversifying visual contexts. A Dual-Constraint Filtering module automatically selects high-fidelity synthetic samples based on cross-modal consistency and anchor fidelity. Experiments show SAMA consistently outperforms state-of-the-art baselines, achieving a 1.7% F1 improvement over GMDA on MNER (10% setting), a 2.0% F1 boost for HVPNeT on MRE (10% split), and up to 5.9% F1 on MEE (10% setting), even providing gains in fully supervised scenarios.

Key takeaway

For AI Scientists and Machine Learning Engineers addressing data scarcity in multimodal information extraction, SAMA offers a robust, unified data augmentation solution. You should consider integrating SAMA's anchor-driven generation and dual-constraint filtering to create high-fidelity synthetic data, significantly improving model generalization and performance in low-resource MNER, MRE, and MEE tasks. This approach reduces reliance on costly manual annotations and enhances model robustness even in fully supervised settings.

Key insights

SAMA unifies multimodal data augmentation for MIE tasks using semantic anchors to guide high-fidelity text and image synthesis.

Principles

Semantic anchors enforce strict cross-modal consistency.
Collaborative experts balance shared and task-specific knowledge.
Anchor-preserving diffusion prevents visual identity drift.

Method

SAMA constructs semantic anchors from ground-truth labels, uses a CME-MLLM for text generation, and an Anchor-Preserving Diffusion for image synthesis, followed by Dual-Constraint Filtering for quality control.

In practice

Use structured tags for entity, relation, and event anchors.
Employ LoRA for universal and task-specific adapters.
Filter synthetic data with cross-modal and anchor fidelity scores.

Topics

Multimodal Information Extraction
Data Augmentation
Large Language Models
Diffusion Models
Named Entity Recognition
Relation Extraction
Event Extraction

Code references

UESTC-GQJ/SAMA

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.