CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

CzechDocs is a multiway parallel dataset of formatted documents designed to evaluate machine translation systems' ability to preserve document formatting, especially for minority languages in Czechia. It covers Czech and minority languages, primarily Ukrainian and English, with smaller portions of Vietnamese, Russian, and others. The dataset comprises 316 parallel language mutations from 77 unique documents in HTML, DOCX, and PDF formats, totaling 60,153 translatable segments, 271,111 words, and 126,833 markup tags. Notably, 90.3% of segments include at least one markup tag, averaging 2.1 tags per segment. Researchers compared detag-and-project methods against direct tagged input using LLMs like Aya-expanse-8b and gpt-4.1-nano. While LLMs demonstrated markup transfer capability, detagging and reinsertion slightly improved tagged BLEU scores, and explicit tag preservation prompts enhanced gpt-4.1-nano's performance. A validation split and evaluation toolkit are publicly available at https://github.com/cepin19/CzechDocs.

Key takeaway

For Machine Learning Engineers developing document translation systems, you should consider integrating the CzechDocs dataset for robust evaluation of format-preserving machine translation. Your choice between detag-and-project and direct tagged input with LLMs should be informed by specific document characteristics, as performance varies. Explicitly instructing your LLM to preserve markup can notably improve tag accuracy, especially with models like gpt-4.1-nano, ensuring better structural fidelity in translated outputs.

Key insights

A new multiway parallel dataset facilitates evaluating machine translation systems' ability to preserve document formatting, particularly for minority languages.

Principles

Method

The evaluation compared detag-and-project, direct tagged input, and prompt-emphasized tag preservation using Aya-expanse-8b and gpt-4.1-nano LLMs, measuring tagged and detagged BLEU scores.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.