How Grab Built a Vision LLM to Scan Images

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, long

Summary

Grab Engineering Team developed a specialized, lightweight Vision Large Language Model (LLM) to enhance document information extraction, particularly for diverse Southeast Asian languages and document formats. Traditional OCR systems and proprietary LLMs struggled with accuracy, latency, and language support, while open-source Vision LLMs lacked production-grade precision. Grab initially fine-tuned the Qwen2-VL 2B model, employing both LoRA and full-parameter fine-tuning, which significantly improved accuracy for non-Latin scripts. Ultimately, Grab built a custom 1-billion-parameter Vision LLM from scratch, combining Qwen2-VL's vision encoder with Qwen2.5 0.5B's language decoder. This model, trained through a four-stage process including synthetic data generation and an auto-labeling framework called Documint, achieved comparable accuracy to the 2B model while demonstrating 48% to 56% faster processing times across P50, P90, and P99 latencies.

Key takeaway

For MLOps Engineers deploying document processing solutions in multilingual regions, consider developing specialized Vision LLMs rather than relying solely on general-purpose models. Your team should invest in generating high-quality synthetic and auto-labeled datasets and prioritize base models with native language and dynamic resolution support to achieve superior accuracy and latency, especially for non-Latin scripts.

Key insights

Custom-built, lightweight Vision LLMs can surpass larger general-purpose models for specialized document processing tasks.

Principles

Full parameter fine-tuning outperforms LoRA for non-Latin scripts.
Native language support in base models is crucial for regional success.
Dynamic resolution support improves OCR accuracy significantly.

Method

Grab's method involved a four-stage training process: projector alignment, vision tower enhancement, language-specific visual training using synthetic OCR data, and task-centric full-parameter fine-tuning on curated document datasets.

In practice

Generate synthetic OCR datasets for diverse language coverage.
Implement auto-labeling frameworks like Documint for real documents.
Prioritize models supporting native image resolution for OCR.

Topics

Vision LLMs
Document Processing
eKYC
Model Fine-tuning
Southeast Asian Languages

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.