How Grab Built a Vision LLM to Scan Images

· Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, long

Summary

Grab Engineering Team developed a specialized, lightweight Vision Large Language Model (LLM) to enhance document information extraction, particularly for diverse Southeast Asian languages and document formats. Traditional OCR systems and proprietary LLMs struggled with accuracy, latency, and language support, while open-source Vision LLMs lacked production-grade precision. Grab initially fine-tuned the Qwen2-VL 2B model, employing both LoRA and full-parameter fine-tuning, which significantly improved accuracy for non-Latin scripts. Ultimately, Grab built a custom 1-billion-parameter Vision LLM from scratch, combining Qwen2-VL's vision encoder with Qwen2.5 0.5B's language decoder. This model, trained through a four-stage process including synthetic data generation and an auto-labeling framework called Documint, achieved comparable accuracy to the 2B model while demonstrating 48% to 56% faster processing times across P50, P90, and P99 latencies.

Key takeaway

For MLOps Engineers deploying document processing solutions in multilingual regions, consider developing specialized Vision LLMs rather than relying solely on general-purpose models. Your team should invest in generating high-quality synthetic and auto-labeled datasets and prioritize base models with native language and dynamic resolution support to achieve superior accuracy and latency, especially for non-Latin scripts.

Key insights

Custom-built, lightweight Vision LLMs can surpass larger general-purpose models for specialized document processing tasks.

Principles

Method

Grab's method involved a four-stage training process: projector alignment, vision tower enhancement, language-specific visual training using synthetic OCR data, and task-centric full-parameter fine-tuning on curated document datasets.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.