Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, medium

Summary

A novel retrieval-augmented image captioning framework, "Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning," addresses the limitations of traditional methods in generating context-rich descriptions for news images. This framework utilizes external knowledge to provide deeper insights, such as object attributes and event context, not directly visible. It employs a hierarchical multi-modal article retrieval mechanism that analyzes article structure, including weighted textual components like headlines and body sections, and visual placement patterns. The system also uses multi-faceted similarity computations (content-visual, visual-visual, discourse positioning) and a contextual relevance refinement stage. A Vision-Language Model (VLM) first creates a concise image description, which then guides the segmentation of relevant information from retrieved articles. Finally, a Large Language Model (LLM) combines this description and extracted knowledge to produce a comprehensive, contextually detailed caption. The framework secured 5th place with a 0.2824 score on the OpenEvent-V1 dataset in the ACM Multimedia EVENTA 2025 Challenge.

Key takeaway

For Machine Learning Engineers developing advanced image captioning systems, especially for news media, you should consider integrating hierarchical multi-modal retrieval. This method, which employs article structure and multi-faceted similarity, significantly enhances caption depth by incorporating external knowledge. Implement a multi-stage generation pipeline using a VLM for initial descriptions and an LLM for final context-rich captions to overcome limitations of purely visual approaches.

Key insights

The framework uses hierarchical multi-modal retrieval and LLM integration to generate context-rich news image captions from external knowledge.

Principles

Method

Hierarchical multi-modal article retrieval considering structure and visual patterns. Refine relevance. VLM generates description. Segment knowledge. LLM generates caption.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.