From 4 Weeks to 45 Minutes: Designing a Document Extraction System for 4,700+ PDFs
Summary
An engineering team faced the challenge of extracting revision numbers (REV values) from over 4,700 PDF engineering drawings for a new asset-management system, a task estimated to cost £8,000 and 160 person-hours if done manually. The drawings presented significant complexity, with 70-80% being text-based and 20-30% image-based, alongside varied REV formats (e.g., "1-0", "A", "AA") and common false positive sources like revision history tables and grid references. A hybrid pipeline was developed, combining a zero-cost, rule-based PyMuPDF extraction for text-based PDFs with GPT-4 Vision via Azure OpenAI for image-based or ambiguous cases. This approach achieved 96% accuracy on a 400-file sample, processed all 4,730 documents in 45 minutes, and incurred only $10-15 in API costs, significantly outperforming a GPT-4-only approach in speed and cost while maintaining acceptable accuracy for the use case.
Key takeaway
For AI Engineers building data extraction solutions, you should design hybrid systems that prioritize deterministic methods before resorting to costly LLMs. Validate your solutions against a large, representative dataset to uncover production-level complexities like document rotation and prompt biases, ensuring your system delivers the right balance of accuracy, cost, and speed for stakeholder needs.
Key insights
A hybrid AI and rule-based system efficiently extracts data from complex PDFs, balancing cost, speed, and accuracy.
Principles
- Prioritize deterministic methods over LLMs when possible.
- Validate at scale to uncover real-world edge cases.
- Prompt engineering is a critical software component.
Method
A two-stage pipeline first attempts rule-based text extraction using PyMuPDF, then falls back to GPT-4 Vision for image-based or ambiguous PDFs, handling rotation and prompt hallucination.
In practice
- Use PyMuPDF for initial, zero-cost text extraction.
- Render PDF pages at 150 DPI for GPT-4 Vision.
- Diversify prompt examples to prevent model memorization.
Topics
- Engineering Drawings
- PDF Document Extraction
- Hybrid System Architecture
- PyMuPDF
- GPT-4 Vision
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.