Comparative Evaluation of Machine Translation Systems on Images with Text
Summary
A comparative evaluation published on 2026-05-28 assessed machine translation systems for images containing text, a task blending computer vision and natural language processing. The study analyzed three paradigms: modular pipelines using docTR for OCR and multilingual LLMs like Llama and EuroLLM; multi-modal large language models (MLLMs) including Gemini 2.5 configurations; and the end-to-end Translatotron-V model. Experiments on parallel multilingual datasets, evaluated with BLEU, chrF, and TER metrics, revealed that modular pipelines surpassed the end-to-end approach. Crucially, MLLMs achieved the best overall performance, demonstrating superior flexibility and contextual understanding. These findings highlight the effectiveness of multi-modal reasoning for image-to-text translation and lay groundwork for future multilingual visual-language integration research.
Key takeaway
For NLP Engineers developing image-to-text translation systems, you should prioritize multi-modal large language models (MLLMs) over modular or end-to-end approaches. MLLMs like Gemini 2.5 offer superior flexibility and contextual understanding, leading to better translation quality. Consider integrating MLLMs into your workflows to enhance accuracy and efficiency, especially for complex visual documents. This shift can significantly improve the robustness of your multilingual visual-language applications.
Key insights
MLLMs excel in image-to-text translation, outperforming other paradigms through enhanced contextual understanding.
Principles
- Multi-modal reasoning significantly improves image-to-text translation accuracy.
- Modular pipelines combining OCR and LLMs outperform end-to-end image translation.
- Contextual understanding is critical for effective image-based text translation.
Method
This evaluation compared modular pipelines (docTR + Llama/EuroLLM), MLLMs (Gemini 2.5), and Translatotron-V on multilingual datasets, using BLEU, chrF, and TER metrics.
In practice
- Implement MLLMs like Gemini 2.5 for image-to-text translation.
- Integrate OCR (docTR) with multilingual LLMs for robust modular solutions.
- Utilize BLEU, chrF, and TER for evaluating image translation quality.
Topics
- Machine Translation
- Multi-modal LLMs
- Image-to-Text Translation
- Optical Character Recognition
- Computer Vision
- Natural Language Processing
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.