Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All
Summary
This article details a cost-optimized cascade system designed to make images within PDFs searchable for Retrieval Augmented Generation (RAG) systems. Building upon the "image_df" generated by document parsing, the system employs a multi-stage approach to convert relevant images into descriptive text without incurring unnecessary costs. It first applies a cheap filter to discard images based on size, shape, or repetition (e.g., logos). Remaining images are then classified into types like "decorative", "text", "chart", "diagram", or "photo" using pixel statistics. "Decorative" images are skipped. "Text" images, such as screenshots or scanned tables, are processed by classic OCR (e.g., EasyOCR) locally and for free, with confidence checks to fall back to vision models for low-confidence reads. Finally, "chart", "diagram", and "photo" types, where meaning is genuinely visual, are sent to a vision LLM, which is the costliest step. The generated descriptions are integrated into the document's "line_df", making image content searchable alongside text.
Key takeaway
For MLOps Engineers building enterprise RAG systems, optimizing image processing costs is crucial. You should implement a multi-stage image analysis cascade to avoid expensive vision model calls on irrelevant content. By filtering decorative images, classifying types, and using OCR for text-based images before resorting to vision LLMs for charts or diagrams, you significantly reduce processing expenses and latency. This approach ensures your RAG system efficiently extracts searchable content from visual elements, improving retrieval quality without overspending.
Key insights
Cost-ordered image analysis for RAG prioritizes cheap filters and OCR before costly vision models.
Principles
- Most PDF images lack retrieval value.
- Classify image types before processing.
- Prioritize cheapest analysis method first.
Method
A cascade filters images by size/shape/repetition, classifies by pixel signals, then dispatches to skip, OCR (with confidence check), or a vision LLM with type-tuned prompts.
In practice
- Filter images by size, shape, and content hash.
- Classify images using pixel dispersion and saturation.
- Store image descriptions in "line_df" text slots.
Topics
- Retrieval-Augmented Generation
- PDF Image Processing
- Cost-Optimized Analysis
- Optical Character Recognition
- Vision LLMs
- Document Intelligence
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.