Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning
Summary
A novel retrieval-augmented image captioning framework, "Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning," addresses the limitations of traditional methods in generating context-rich descriptions for news images. This framework utilizes external knowledge to provide deeper insights, such as object attributes and event context, not directly visible. It employs a hierarchical multi-modal article retrieval mechanism that analyzes article structure, including weighted textual components like headlines and body sections, and visual placement patterns. The system also uses multi-faceted similarity computations (content-visual, visual-visual, discourse positioning) and a contextual relevance refinement stage. A Vision-Language Model (VLM) first creates a concise image description, which then guides the segmentation of relevant information from retrieved articles. Finally, a Large Language Model (LLM) combines this description and extracted knowledge to produce a comprehensive, contextually detailed caption. The framework secured 5th place with a 0.2824 score on the OpenEvent-V1 dataset in the ACM Multimedia EVENTA 2025 Challenge.
Key takeaway
For Machine Learning Engineers developing advanced image captioning systems, especially for news media, you should consider integrating hierarchical multi-modal retrieval. This method, which employs article structure and multi-faceted similarity, significantly enhances caption depth by incorporating external knowledge. Implement a multi-stage generation pipeline using a VLM for initial descriptions and an LLM for final context-rich captions to overcome limitations of purely visual approaches.
Key insights
The framework uses hierarchical multi-modal retrieval and LLM integration to generate context-rich news image captions from external knowledge.
Principles
- External knowledge enhances image captioning depth.
- Article structure and visual context improve retrieval.
- Multi-stage generation refines caption quality.
Method
Hierarchical multi-modal article retrieval considering structure and visual patterns. Refine relevance. VLM generates description. Segment knowledge. LLM generates caption.
In practice
- Integrate structured article retrieval for richer captions.
- Combine VLM and LLM for multi-stage captioning.
- Utilize OpenEvent-V1 dataset for news captioning.
Topics
- Image Captioning
- Multi-Modal Retrieval
- Knowledge-Grounded Generation
- Large Language Models
- Vision-Language Models
- News Media AI
- OpenEvent-V1 Dataset
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.