Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models
Summary
A new resource, the AG-MG Parallel Corpus, has been introduced to address the low-resource challenge of Ancient Greek (AG) to Modern Greek (MG) machine translation. This corpus comprises 132,481 sentence-aligned pairs from literary, historical, and biblical texts. Its creation involved a multi-stage pipeline combining web-scraped data with VecAlign, fine-tuned on a manually-aligned AG-MG subset using LaBSE embeddings, and an LLM-based error correction phase utilizing Gemini 2.5 Flash. The research also establishes the first comprehensive benchmark for modern MT models on this task, evaluating NMT models (NLLB, M2M100) and the Llama-Krikri-8B Greek LLM. Fine-tuning strategies resulted in performance gains of up to +10.3 BLEU points, with Llama-Krikri-8B achieving the highest BLEU score of 13.16.
Key takeaway
For research scientists developing machine translation systems for low-resource language pairs, this work demonstrates that a carefully constructed parallel corpus, combined with fine-tuned embeddings and LLM-based error correction, can significantly improve translation quality. You should consider full-parameter fine-tuning of large language models like Llama-Krikri-8B for superior performance, or QLoRA for competitive results with substantial relative gains on NMT models like M2M100-1.2B.
Key insights
A new AG-MG parallel corpus and benchmark significantly advance Ancient Greek to Modern Greek machine translation.
Principles
- Low-resource MT benefits from specialized parallel corpora.
- LLMs can refine sentence alignment quality.
- Fine-tuning improves MT performance on specific tasks.
Method
The corpus creation pipeline combines web-scraping, multi-stage sentence alignment using VecAlign with fine-tuned LaBSE embeddings, and LLM-based error correction via Gemini 2.5 Flash.
In practice
- Use VecAlign with fine-tuned embeddings for alignment.
- Employ LLMs like Gemini 2.5 Flash for alignment correction.
- Apply full-parameter fine-tuning for optimal LLM MT.
Topics
- Ancient Greek to Modern Greek MT
- Low-Resource NLP
- Parallel Corpus Creation
- LLM Fine-tuning
- Llama-Krikri-8B
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.