PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks
Summary
PaliBench introduces a multi-reference benchmark and methodology for evaluating machine translation of classical languages, specifically focusing on Pali-to-English translation of Buddhist canonical texts. Standard benchmarks, which rely on a single reference translation, are ill-suited for classical texts that often support multiple faithful interpretations. PaliBench addresses this by drawing passages from the Sutta Pitaka and aligning them with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow involves LLM-assisted alignment, automated verification, quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against these multiple human references. The resulting benchmark contains 1,700 passages, comprising 8,389 segments and approximately 345,000 tokens. Evaluation of ten large language models using semantic, lexical, and neural metrics revealed strong cross-metric concordance in system rankings, with top models achieving high semantic similarity (e.g., Gemini 3 Pro at 0.946) and low outlier rates (3.4%), but also substantial variation in reliability and semantic divergence.
Key takeaway
For AI Scientists and NLP Engineers developing or evaluating machine translation systems for classical or low-resource languages, your approach must account for interpretive plurality. Adopt multi-reference evaluation frameworks like PaliBench to accurately assess model performance against diverse, valid human translations. Be aware that current LLMs may converge on a "centroid-style" output, potentially flattening interpretive diversity, so consider curating training data to include a broader range of translation traditions to preserve scholarly nuance.
Key insights
Multi-reference benchmarks are essential for evaluating classical language translation due to inherent interpretive plurality.
Principles
- Translation of ancient texts is interpretive, not transcription.
- Single-reference evaluation is biased for classical texts.
- Semantic embeddings tolerate stylistic variation in translation.
Method
The PaliBench workflow constructs multi-reference benchmarks by aligning independently segmented expert translations, verifying extraction, filtering for quality, deduplicating repetitions, and evaluating with complementary semantic, lexical, and neural metrics.
In practice
- Use LLM-assisted alignment for diverse translation segmentation.
- Implement multi-stage verification to detect LLM extraction errors.
- Combine semantic and lexical metrics for robust evaluation.
Topics
- PaliBench
- Multi-Reference Translation
- Classical Language Translation
- Machine Translation Evaluation
- Large Language Models
Code references
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.