Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen

2026-02-27 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

A study evaluated the quality of commercial large language models (LLMs) for translating Ancient Greek technical prose, specifically passages from two works by the physician Galen (c. 129-216 CE). Researchers assessed 60 translations generated by ChatGPT, Claude, and Gemini for 20 paragraph-length passages. Quality was measured using seven automated metrics and a modified Multidimensional Quality Metrics (MQM) framework applied by domain specialists. For an expository text with existing English translations, LLMs achieved a mean MQM score of 95.2/100. For a previously untranslated pharmacological text, the mean quality was lower at 79.9/100, with two passages exhibiting extreme terminological density leading to catastrophic failures. Terminology rarity, based on corpus frequency, was identified as the primary predictor of translation failure (r = -.97). Automated metrics showed only moderate correlation with human judgment and could not differentiate between high-quality translations.

Key takeaway

For research scientists or NLP engineers working with low-resource or ancient languages, recognize that LLMs can achieve high translation quality for expository texts but struggle significantly with highly specialized, rare terminology. Your translation workflows should incorporate expert human review, especially for texts with dense or unique vocabulary, as automated metrics may not reliably flag subtle errors in high-quality outputs. Consider pre-processing texts to identify and potentially pre-translate rare terms.

Key insights

LLMs translate Ancient Greek technical texts with high accuracy, but rare terminology significantly predicts failure.

Principles

Terminology rarity predicts LLM translation failure.
Automated metrics struggle with high-quality translations.

Method

The study used a modified Multidimensional Quality Metrics (MQM) framework with domain specialists for reference-free human evaluation of LLM translations, alongside automated metrics, to assess Ancient Greek technical prose.

In practice

Identify passages with high terminological density.
Prioritize human review for texts with rare vocabulary.

Topics

LLM Translation
Ancient Greek Translation
Machine Translation Quality
Low-Resource Languages
Galen's Medical Texts

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.