Comparative Evaluation of Machine Translation Systems on Images with Text

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Computer Vision · Depth: Advanced, quick

Summary

A comparative evaluation published on 2026-05-28 assessed machine translation systems for images containing text, a task blending computer vision and natural language processing. The study analyzed three paradigms: modular pipelines using docTR for OCR and multilingual LLMs like Llama and EuroLLM; multi-modal large language models (MLLMs) including Gemini 2.5 configurations; and the end-to-end Translatotron-V model. Experiments on parallel multilingual datasets, evaluated with BLEU, chrF, and TER metrics, revealed that modular pipelines surpassed the end-to-end approach. Crucially, MLLMs achieved the best overall performance, demonstrating superior flexibility and contextual understanding. These findings highlight the effectiveness of multi-modal reasoning for image-to-text translation and lay groundwork for future multilingual visual-language integration research.

Key takeaway

For NLP Engineers developing image-to-text translation systems, you should prioritize multi-modal large language models (MLLMs) over modular or end-to-end approaches. MLLMs like Gemini 2.5 offer superior flexibility and contextual understanding, leading to better translation quality. Consider integrating MLLMs into your workflows to enhance accuracy and efficiency, especially for complex visual documents. This shift can significantly improve the robustness of your multilingual visual-language applications.

Key insights

MLLMs excel in image-to-text translation, outperforming other paradigms through enhanced contextual understanding.

Principles

Method

This evaluation compared modular pipelines (docTR + Llama/EuroLLM), MLLMs (Gemini 2.5), and Translatotron-V on multilingual datasets, using BLEU, chrF, and TER metrics.

In practice

Topics

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.