Benchmarking RAG Architectures Locally on a Real Financial PDF — Part 3: The Measurement Problem

2026-06-21 · Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

A benchmarking study on RAG architectures, detailed in Part 3 of a series, reveals a critical "measurement problem" when evaluating multimodal systems on a real financial PDF. The study compared six architectures, including text-based, graph, and vision RAG, using 57 questions and a local qwen3:14b judge. Vision RAG, utilizing a small local VLM (qwen3-vl:8b), achieved 0.823 answer correctness, significantly outperforming the best text pipeline's 0.544. This advantage was most pronounced for chart questions, where Vision RAG scored 0.812 against the text pipeline's 0.391. The analysis highlights that standard faithfulness metrics, which compare answers against extracted text, often misrepresent multimodal RAG performance, punishing correct answers derived from visual information not present in the text.

Key takeaway

For MLOps Engineers or AI Scientists evaluating RAG systems, recognize that standard faithfulness metrics can misleadingly rank multimodal architectures. You must prioritize answer correctness against an independent reference answer, especially when dealing with documents containing non-textual elements like charts. Relying solely on text-grounded metrics will misrepresent your system's true accuracy, potentially leading you to deploy less effective solutions. Ensure your evaluation strategy aligns with your RAG architecture's capabilities.

Key insights

Multimodal RAG evaluation requires answer correctness against independent references, as text-grounded metrics misrepresent performance.

Principles

Evaluation metrics must match architecture.
Text-grounded metrics misrepresent multimodal RAG.
Charts are only recoverable via vision models.

Method

The study benchmarked six RAG architectures (text, graph, vision) on 57 questions from a financial PDF using a local qwen3:14b judge, breaking results by question type (text, table, chart).

In practice

Use answer correctness for multimodal RAG.
Treat faithfulness as a text channel diagnostic.
Consider local VLMs for document parsing.

Topics

RAG Architectures
Multimodal RAG
Evaluation Metrics
Vision Language Models
Financial Documents
Benchmarking

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.