Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

2026-05-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Meta AI researchers introduce an OCR-aware multilingual multimodal training framework designed to enhance the robustness of multimodal large language models (MLLMs) in understanding text from real-world images. The framework integrates large-scale synthetic OCR-to-translation data generation, OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and structured visual chain-of-thought (CoT) prompting. Utilizing a LLaMA-based multimodal architecture, this approach significantly improves OCR completeness, multilingual translation accuracy, and resilience under degraded visual conditions like blur, occlusion, and small fonts. Experimental results on diverse multilingual documents, including receipts, menus, and handwritten text, demonstrate superior visual-text grounding compared to baseline models and qualitative improvements over frontier systems like GPT-5-class and Gemini-family models, particularly in reducing hallucination and improving extraction of challenging text.

Key takeaway

For AI Engineers developing multimodal systems that process real-world images, adopting an OCR-aware post-training framework is crucial. Your models will achieve higher OCR completeness and multilingual translation accuracy, significantly reducing hallucination in visually degraded conditions. Prioritize data-centric approaches, including synthetic data generation with realistic visual degradations, and consider integrating structured visual CoT prompting to enhance robustness and visual grounding, especially for small, blurred, or occluded text.

Key insights

OCR-aware data curation and fine-tuning significantly improve MLLM robustness for multilingual text in degraded images.

Principles

Data-centric post-training enhances MLLM OCR robustness.
Explicit visual reasoning reduces hallucination under uncertainty.
Modular generative translation improves text replacement fidelity.

Method

The framework combines synthetic OCR data generation, LoRA-based supervised fine-tuning, and structured visual chain-of-thought prompting to train LLaMA-based MLLMs for improved OCR and multilingual understanding.

In practice

Generate synthetic OCR data with realistic degradations.
Apply LoRA for parameter-efficient OCR-aware fine-tuning.
Use structured CoT prompts for explicit visual reasoning.

Topics

Multimodal Large Language Models
OCR-Aware Fine-Tuning
Chain-of-Thought Prompting
Multilingual OCR
Synthetic Data Generation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.