Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 3: The Measurement Problem
Summary
A benchmarking study on RAG architectures, detailed in Part 3 of a series, reveals a critical "measurement problem" when evaluating multimodal systems on a real financial PDF. The study compared six architectures, including text-based, graph, and vision RAG, using 57 questions and a local qwen3:14b judge. Vision RAG, utilizing a small local VLM (qwen3-vl:8b), achieved 0.823 answer correctness, significantly outperforming the best text pipeline's 0.544. This advantage was most pronounced for chart questions, where Vision RAG scored 0.812 against the text pipeline's 0.391. The analysis highlights that standard faithfulness metrics, which compare answers against extracted text, often misrepresent multimodal RAG performance, punishing correct answers derived from visual information not present in the text.
Key takeaway
For MLOps Engineers or AI Scientists evaluating RAG systems, recognize that standard faithfulness metrics can misleadingly rank multimodal architectures. You must prioritize answer correctness against an independent reference answer, especially when dealing with documents containing non-textual elements like charts. Relying solely on text-grounded metrics will misrepresent your system's true accuracy, potentially leading you to deploy less effective solutions. Ensure your evaluation strategy aligns with your RAG architecture's capabilities.
Key insights
Multimodal RAG evaluation requires answer correctness against independent references, as text-grounded metrics misrepresent performance.
Principles
- Evaluation metrics must match architecture.
- Text-grounded metrics misrepresent multimodal RAG.
- Charts are only recoverable via vision models.
Method
The study benchmarked six RAG architectures (text, graph, vision) on 57 questions from a financial PDF using a local qwen3:14b judge, breaking results by question type (text, table, chart).
In practice
- Use answer correctness for multimodal RAG.
- Treat faithfulness as a text channel diagnostic.
- Consider local VLMs for document parsing.
Topics
- RAG Architectures
- Multimodal RAG
- Evaluation Metrics
- Vision Language Models
- Financial Documents
- Benchmarking
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.