How Grab Built a Vision LLM to Scan Images
Summary
Grab Engineering Team developed a specialized, lightweight Vision Large Language Model (LLM) to enhance document information extraction, particularly for diverse Southeast Asian languages and document formats. Traditional OCR systems and proprietary LLMs struggled with accuracy, latency, and language support, while open-source Vision LLMs lacked production-grade precision. Grab initially fine-tuned the Qwen2-VL 2B model, employing both LoRA and full-parameter fine-tuning, which significantly improved accuracy for non-Latin scripts. Ultimately, Grab built a custom 1-billion-parameter Vision LLM from scratch, combining Qwen2-VL's vision encoder with Qwen2.5 0.5B's language decoder. This model, trained through a four-stage process including synthetic data generation and an auto-labeling framework called Documint, achieved comparable accuracy to the 2B model while demonstrating 48% to 56% faster processing times across P50, P90, and P99 latencies.
Key takeaway
For MLOps Engineers deploying document processing solutions in multilingual regions, consider developing specialized Vision LLMs rather than relying solely on general-purpose models. Your team should invest in generating high-quality synthetic and auto-labeled datasets and prioritize base models with native language and dynamic resolution support to achieve superior accuracy and latency, especially for non-Latin scripts.
Key insights
Custom-built, lightweight Vision LLMs can surpass larger general-purpose models for specialized document processing tasks.
Principles
- Full parameter fine-tuning outperforms LoRA for non-Latin scripts.
- Native language support in base models is crucial for regional success.
- Dynamic resolution support improves OCR accuracy significantly.
Method
Grab's method involved a four-stage training process: projector alignment, vision tower enhancement, language-specific visual training using synthetic OCR data, and task-centric full-parameter fine-tuning on curated document datasets.
In practice
- Generate synthetic OCR datasets for diverse language coverage.
- Implement auto-labeling frameworks like Documint for real documents.
- Prioritize models supporting native image resolution for OCR.
Topics
- Vision LLMs
- Document Processing
- eKYC
- Model Fine-tuning
- Southeast Asian Languages
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.