Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

2026-05-18 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A new resource, the AG-MG Parallel Corpus, has been introduced to address the low-resource challenge of Ancient Greek (AG) to Modern Greek (MG) machine translation. This corpus comprises 132,481 sentence-aligned pairs from literary, historical, and biblical texts. Its creation involved a multi-stage pipeline combining web-scraped data with VecAlign, fine-tuned on a manually-aligned AG-MG subset using LaBSE embeddings, and an LLM-based error correction phase utilizing Gemini 2.5 Flash. The research also establishes the first comprehensive benchmark for modern MT models on this task, evaluating NMT models (NLLB, M2M100) and the Llama-Krikri-8B Greek LLM. Fine-tuning strategies resulted in performance gains of up to +10.3 BLEU points, with Llama-Krikri-8B achieving the highest BLEU score of 13.16.

Key takeaway

For research scientists developing machine translation systems for low-resource language pairs, this work demonstrates that a carefully constructed parallel corpus, combined with fine-tuned embeddings and LLM-based error correction, can significantly improve translation quality. You should consider full-parameter fine-tuning of large language models like Llama-Krikri-8B for superior performance, or QLoRA for competitive results with substantial relative gains on NMT models like M2M100-1.2B.

Key insights

A new AG-MG parallel corpus and benchmark significantly advance Ancient Greek to Modern Greek machine translation.

Principles

Low-resource MT benefits from specialized parallel corpora.
LLMs can refine sentence alignment quality.
Fine-tuning improves MT performance on specific tasks.

Method

The corpus creation pipeline combines web-scraping, multi-stage sentence alignment using VecAlign with fine-tuned LaBSE embeddings, and LLM-based error correction via Gemini 2.5 Flash.

In practice

Use VecAlign with fine-tuned embeddings for alignment.
Employ LLMs like Gemini 2.5 Flash for alignment correction.
Apply full-parameter fine-tuning for optimal LLM MT.

Topics

Ancient Greek to Modern Greek MT
Low-Resource NLP
Parallel Corpus Creation
LLM Fine-tuning
Llama-Krikri-8B

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.