Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings
Summary
An open-source MultiModal Proxy-Pointer RAG pipeline is introduced, designed to enable enterprise chatbots to reliably return images grounded in source documents, addressing a significant limitation of current text-only RAG systems. Unlike traditional RAG that processes documents as a "bag-of-words," this pipeline views documents as hierarchical trees of semantic blocks, allowing for accurate image retrieval without requiring multimodal embeddings. The system was prototyped on five AI research papers, containing 270 images, achieving 95% accuracy for image retrievals on a 20-question benchmark. It utilizes the Adobe PDF Extract API for PDF parsing, `gemini-embedding-001` for text embeddings, and `gemini-3.1-flash-lite-preview` for LLM tasks, including noise filtering, re-ranking, and synthesis. The core innovation lies in its structure-guided chunking and pointer-based context, ensuring images are selected based on full section context rather than fragmented captions or ambiguous multimodal similarity.
Key takeaway
For AI Engineers building enterprise RAG systems, integrating the MultiModal Proxy-Pointer RAG pipeline can significantly enhance chatbot capabilities by enabling accurate, context-grounded image responses. Your teams should consider adopting this open-source, structure-aware approach to overcome the limitations of traditional chunk-based RAG, ensuring visual evidence is precisely aligned with semantic context and improving user trust in multimodal interactions.
Key insights
Multimodal RAG success hinges on aligning retrieval with document structure, not just embedding similarity.
Principles
- Document structure is key for visual coherence.
- Chunking by semantic units prevents image misalignment.
- Contextual image selection outperforms direct similarity.
Method
The MultiModal Proxy-Pointer RAG pipeline builds a hierarchical document tree, injects breadcrumbs, performs structure-guided chunking, filters noise, and uses retrieved chunks as pointers to load full sections for LLM synthesis and context-aware image selection.
In practice
- Use hierarchical document parsing for RAG.
- Store image paths within document sections.
- Implement a vision filter for image refinement.
Topics
- Proxy-Pointer RAG
- Multimodal Retrieval
- Document Structure
- Semantic Chunking
- Text-only Embeddings
Code references
Best for: AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.