PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

PaliBench introduces a multi-reference benchmark and methodology for evaluating machine translation of classical languages, specifically focusing on Pali-to-English translation of Buddhist canonical texts. Standard benchmarks, which rely on a single reference translation, are ill-suited for classical texts that often support multiple faithful interpretations. PaliBench addresses this by drawing passages from the Sutta Pitaka and aligning them with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow involves LLM-assisted alignment, automated verification, quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against these multiple human references. The resulting benchmark contains 1,700 passages, comprising 8,389 segments and approximately 345,000 tokens. Evaluation of ten large language models using semantic, lexical, and neural metrics revealed strong cross-metric concordance in system rankings, with top models achieving high semantic similarity (e.g., Gemini 3 Pro at 0.946) and low outlier rates (3.4%), but also substantial variation in reliability and semantic divergence.

Key takeaway

For AI Scientists and NLP Engineers developing or evaluating machine translation systems for classical or low-resource languages, your approach must account for interpretive plurality. Adopt multi-reference evaluation frameworks like PaliBench to accurately assess model performance against diverse, valid human translations. Be aware that current LLMs may converge on a "centroid-style" output, potentially flattening interpretive diversity, so consider curating training data to include a broader range of translation traditions to preserve scholarly nuance.

Key insights

Multi-reference benchmarks are essential for evaluating classical language translation due to inherent interpretive plurality.

Principles

Translation of ancient texts is interpretive, not transcription.
Single-reference evaluation is biased for classical texts.
Semantic embeddings tolerate stylistic variation in translation.

Method

The PaliBench workflow constructs multi-reference benchmarks by aligning independently segmented expert translations, verifying extraction, filtering for quality, deduplicating repetitions, and evaluating with complementary semantic, lexical, and neural metrics.

In practice

Use LLM-assisted alignment for diverse translation segmentation.
Implement multi-stage verification to detect LLM extraction errors.
Combine semantic and lexical metrics for robust evaluation.

Topics

PaliBench
Multi-Reference Translation
Classical Language Translation
Machine Translation Evaluation
Large Language Models

Code references

MateMetzger/palibench

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.