Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

2026-06-17 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, medium

Summary

A novel retrieval-augmented image captioning framework, "Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning," addresses the limitations of traditional methods in generating context-rich descriptions for news images. This framework utilizes external knowledge to provide deeper insights, such as object attributes and event context, not directly visible. It employs a hierarchical multi-modal article retrieval mechanism that analyzes article structure, including weighted textual components like headlines and body sections, and visual placement patterns. The system also uses multi-faceted similarity computations (content-visual, visual-visual, discourse positioning) and a contextual relevance refinement stage. A Vision-Language Model (VLM) first creates a concise image description, which then guides the segmentation of relevant information from retrieved articles. Finally, a Large Language Model (LLM) combines this description and extracted knowledge to produce a comprehensive, contextually detailed caption. The framework secured 5th place with a 0.2824 score on the OpenEvent-V1 dataset in the ACM Multimedia EVENTA 2025 Challenge.

Key takeaway

For Machine Learning Engineers developing advanced image captioning systems, especially for news media, you should consider integrating hierarchical multi-modal retrieval. This method, which employs article structure and multi-faceted similarity, significantly enhances caption depth by incorporating external knowledge. Implement a multi-stage generation pipeline using a VLM for initial descriptions and an LLM for final context-rich captions to overcome limitations of purely visual approaches.

Key insights

The framework uses hierarchical multi-modal retrieval and LLM integration to generate context-rich news image captions from external knowledge.

Principles

External knowledge enhances image captioning depth.
Article structure and visual context improve retrieval.
Multi-stage generation refines caption quality.

Method

Hierarchical multi-modal article retrieval considering structure and visual patterns. Refine relevance. VLM generates description. Segment knowledge. LLM generates caption.

In practice

Integrate structured article retrieval for richer captions.
Combine VLM and LLM for multi-stage captioning.
Utilize OpenEvent-V1 dataset for news captioning.

Topics

Image Captioning
Multi-Modal Retrieval
Knowledge-Grounded Generation
Large Language Models
Vision-Language Models
News Media AI
OpenEvent-V1 Dataset

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.