Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
Summary
The article describes a microservice architecture for operationalizing Document AI, specifically for OCR and LLM pipelines in production, capable of processing thousands of multi-page documents per hour. This system reduced processing costs from \$0.01 to \$0.001 per page while maintaining 96% accuracy. Key design decisions include a hybrid classification strategy using CLIP-KNN with Claude Sonnet fallback, separation of GPU-bound inference from CPU-bound orchestration, asynchronous processing, and independent horizontal scaling. The architecture comprises three microservices: Gateway for ingestion, Workers for orchestration, and an Inference Service for GPU-bound OCR (e.g., DocTR) and VLM capabilities via cloud APIs. Batch profiling revealed that OCR, not language model parsing, dominates end-to-end latency, consuming approximately two-thirds of the processing time for a typical 8-page document, and system saturation is determined by GPU inference capacity.
Key takeaway
For MLOps Engineers deploying Document AI, recognize that OCR, not LLM parsing, will likely dominate pipeline latency and cost. Prioritize optimizing your OCR inference service by scaling GPU resources independently. Implement a microservice architecture with queue-driven communication to ensure fault tolerance and efficient resource utilization, especially when handling high document volumes or diverse model requirements.
Key insights
Microservice architecture for Document AI separates GPU inference from CPU orchestration, revealing OCR as the primary latency bottleneck.
Principles
- Decouple GPU inference from CPU orchestration.
- OCR often bottlenecks document processing.
- Hybrid classification balances cost and accuracy.
Method
The system uses three microservices: Gateway for ingestion, Workers for CPU-bound orchestration, and an Inference Service for GPU-bound OCR and VLM calls, coordinated via message queues.
In practice
- Isolate OCR inference to scale GPU resources.
- Implement hybrid classification for cost control.
- Use message queues for fault-tolerant scaling.
Topics
- Document AI
- Microservices Architecture
- OCR Optimization
- LLM Pipelines
- GPU Inference
- Message Queues
Code references
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.