Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

An open-source MultiModal Proxy-Pointer RAG pipeline is introduced, designed to enable enterprise chatbots to reliably return images grounded in source documents, addressing a significant limitation of current text-only RAG systems. Unlike traditional RAG that processes documents as a "bag-of-words," this pipeline views documents as hierarchical trees of semantic blocks, allowing for accurate image retrieval without requiring multimodal embeddings. The system was prototyped on five AI research papers, containing 270 images, achieving 95% accuracy for image retrievals on a 20-question benchmark. It utilizes the Adobe PDF Extract API for PDF parsing, `gemini-embedding-001` for text embeddings, and `gemini-3.1-flash-lite-preview` for LLM tasks, including noise filtering, re-ranking, and synthesis. The core innovation lies in its structure-guided chunking and pointer-based context, ensuring images are selected based on full section context rather than fragmented captions or ambiguous multimodal similarity.

Key takeaway

For AI Engineers building enterprise RAG systems, integrating the MultiModal Proxy-Pointer RAG pipeline can significantly enhance chatbot capabilities by enabling accurate, context-grounded image responses. Your teams should consider adopting this open-source, structure-aware approach to overcome the limitations of traditional chunk-based RAG, ensuring visual evidence is precisely aligned with semantic context and improving user trust in multimodal interactions.

Key insights

Multimodal RAG success hinges on aligning retrieval with document structure, not just embedding similarity.

Principles

Method

The MultiModal Proxy-Pointer RAG pipeline builds a hierarchical document tree, injects breadcrumbs, performs structure-guided chunking, filters noise, and uses retrieved chunks as pointers to load full sections for LLM synthesis and context-aware image selection.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.