CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
Summary
CzechDocs is a multiway parallel dataset of formatted documents designed to evaluate machine translation systems' ability to preserve document formatting, especially for minority languages in Czechia. It covers Czech and minority languages, primarily Ukrainian and English, with smaller portions of Vietnamese, Russian, and others. The dataset comprises 316 parallel language mutations from 77 unique documents in HTML, DOCX, and PDF formats, totaling 60,153 translatable segments, 271,111 words, and 126,833 markup tags. Notably, 90.3% of segments include at least one markup tag, averaging 2.1 tags per segment. Researchers compared detag-and-project methods against direct tagged input using LLMs like Aya-expanse-8b and gpt-4.1-nano. While LLMs demonstrated markup transfer capability, detagging and reinsertion slightly improved tagged BLEU scores, and explicit tag preservation prompts enhanced gpt-4.1-nano's performance. A validation split and evaluation toolkit are publicly available at https://github.com/cepin19/CzechDocs.
Key takeaway
For Machine Learning Engineers developing document translation systems, you should consider integrating the CzechDocs dataset for robust evaluation of format-preserving machine translation. Your choice between detag-and-project and direct tagged input with LLMs should be informed by specific document characteristics, as performance varies. Explicitly instructing your LLM to preserve markup can notably improve tag accuracy, especially with models like gpt-4.1-nano, ensuring better structural fidelity in translated outputs.
Key insights
A new multiway parallel dataset facilitates evaluating machine translation systems' ability to preserve document formatting, particularly for minority languages.
Principles
- Document structure preservation is crucial for real-world MT deployment.
- LLMs can translate tagged inputs, with explicit prompts improving tag preservation.
- Detag-and-project and direct tagged input offer comparable, yet varied, results.
Method
The evaluation compared detag-and-project, direct tagged input, and prompt-emphasized tag preservation using Aya-expanse-8b and gpt-4.1-nano LLMs, measuring tagged and detagged BLEU scores.
In practice
- Utilize the CzechDocs dataset and toolkit for tag-aware MT research.
- Experiment with explicit tag preservation prompts for LLM-based translation.
Topics
- Machine Translation
- Document Formatting
- Parallel Datasets
- LLM Translation
- Markup Preservation
- Minority Languages
Code references
- cepin19/CzechDocs
- ArtifexSoftware/pdf2docx
- ufal/lindat-translation
- achimr/m4loc
- kukas/document-translation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.