A Dataset for Probing Translationese Preferences in English-to-Swedish Translation
Summary
A new English-to-Swedish dataset has been released to investigate "translationese" preferences in language models. This dataset is the first freely available resource that contrasts translationese sentences with more idiomatic Swedish alternatives, including error tags and problem descriptions for the original translations. Experiments with smaller Swedish and multilingual large language models (LLMs) using this dataset revealed a consistent preference for translationese phrasing. When the English source sentence was removed, models selected human-preferred idiomatic alternatives more frequently, suggesting that source language exposure biases models towards literal translations. However, even without source context, LLMs often favored the translationese variant, highlighting a persistent challenge in generating natural non-English output.
Key takeaway
For AI Engineers developing multilingual LLMs, you should integrate this new English-to-Swedish dataset into your evaluation pipelines. Benchmarking your models against this resource will help you identify and mitigate tendencies towards "translationese" output, ultimately leading to more natural and idiomatic translations in non-English languages. Prioritize training strategies that reduce source language bias to improve translation quality.
Key insights
Language models often prefer literal "translationese" over idiomatic phrasing, especially when exposed to the source language.
Principles
- Source language exposure biases LLMs.
- Translationese is a persistent LLM preference.
Method
The dataset contrasts translationese with idiomatic alternatives, including error tags, to probe LLM preferences for English-to-Swedish translation.
In practice
- Use dataset to benchmark LLM idiomaticity.
- Test LLM output with and without source context.
Topics
- Translationese
- English-to-Swedish Translation
- Language Models
- Linguistic Dataset
- Idiomatic Translation
Best for: AI Engineer, Machine Learning Engineer, AI Scientist, AI Researcher, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.