Comparative Evaluation of Machine Translation Systems on Images with Text

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Computer Vision · Depth: Advanced, quick

Summary

A comparative evaluation published on 2026-05-28 assessed machine translation systems for images containing text, a task blending computer vision and natural language processing. The study analyzed three paradigms: modular pipelines using docTR for OCR and multilingual LLMs like Llama and EuroLLM; multi-modal large language models (MLLMs) including Gemini 2.5 configurations; and the end-to-end Translatotron-V model. Experiments on parallel multilingual datasets, evaluated with BLEU, chrF, and TER metrics, revealed that modular pipelines surpassed the end-to-end approach. Crucially, MLLMs achieved the best overall performance, demonstrating superior flexibility and contextual understanding. These findings highlight the effectiveness of multi-modal reasoning for image-to-text translation and lay groundwork for future multilingual visual-language integration research.

Key takeaway

For NLP Engineers developing image-to-text translation systems, you should prioritize multi-modal large language models (MLLMs) over modular or end-to-end approaches. MLLMs like Gemini 2.5 offer superior flexibility and contextual understanding, leading to better translation quality. Consider integrating MLLMs into your workflows to enhance accuracy and efficiency, especially for complex visual documents. This shift can significantly improve the robustness of your multilingual visual-language applications.

Key insights

MLLMs excel in image-to-text translation, outperforming other paradigms through enhanced contextual understanding.

Principles

Multi-modal reasoning significantly improves image-to-text translation accuracy.
Modular pipelines combining OCR and LLMs outperform end-to-end image translation.
Contextual understanding is critical for effective image-based text translation.

Method

This evaluation compared modular pipelines (docTR + Llama/EuroLLM), MLLMs (Gemini 2.5), and Translatotron-V on multilingual datasets, using BLEU, chrF, and TER metrics.

In practice

Implement MLLMs like Gemini 2.5 for image-to-text translation.
Integrate OCR (docTR) with multilingual LLMs for robust modular solutions.
Utilize BLEU, chrF, and TER for evaluating image translation quality.

Topics

Machine Translation
Multi-modal LLMs
Image-to-Text Translation
Optical Character Recognition
Computer Vision
Natural Language Processing

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.