Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
Summary
A study evaluated the quality of commercial large language models (LLMs) for translating Ancient Greek technical prose, specifically passages from two works by the physician Galen (c. 129-216 CE). Researchers assessed 60 translations generated by ChatGPT, Claude, and Gemini for 20 paragraph-length passages. Quality was measured using seven automated metrics and a modified Multidimensional Quality Metrics (MQM) framework applied by domain specialists. For an expository text with existing English translations, LLMs achieved a mean MQM score of 95.2/100. For a previously untranslated pharmacological text, the mean quality was lower at 79.9/100, with two passages exhibiting extreme terminological density leading to catastrophic failures. Terminology rarity, based on corpus frequency, was identified as the primary predictor of translation failure (r = -.97). Automated metrics showed only moderate correlation with human judgment and could not differentiate between high-quality translations.
Key takeaway
For research scientists or NLP engineers working with low-resource or ancient languages, recognize that LLMs can achieve high translation quality for expository texts but struggle significantly with highly specialized, rare terminology. Your translation workflows should incorporate expert human review, especially for texts with dense or unique vocabulary, as automated metrics may not reliably flag subtle errors in high-quality outputs. Consider pre-processing texts to identify and potentially pre-translate rare terms.
Key insights
LLMs translate Ancient Greek technical texts with high accuracy, but rare terminology significantly predicts failure.
Principles
- Terminology rarity predicts LLM translation failure.
- Automated metrics struggle with high-quality translations.
Method
The study used a modified Multidimensional Quality Metrics (MQM) framework with domain specialists for reference-free human evaluation of LLM translations, alongside automated metrics, to assess Ancient Greek technical prose.
In practice
- Identify passages with high terminological density.
- Prioritize human review for texts with rare vocabulary.
Topics
- LLM Translation
- Ancient Greek Translation
- Machine Translation Quality
- Low-Resource Languages
- Galen's Medical Texts
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.