A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A new English-to-Swedish dataset has been released to investigate "translationese" preferences in language models. This dataset is the first freely available resource that contrasts translationese sentences with more idiomatic Swedish alternatives, including error tags and problem descriptions for the original translations. Experiments with smaller Swedish and multilingual large language models (LLMs) using this dataset revealed a consistent preference for translationese phrasing. When the English source sentence was removed, models selected human-preferred idiomatic alternatives more frequently, suggesting that source language exposure biases models towards literal translations. However, even without source context, LLMs often favored the translationese variant, highlighting a persistent challenge in generating natural non-English output.

Key takeaway

For AI Engineers developing multilingual LLMs, you should integrate this new English-to-Swedish dataset into your evaluation pipelines. Benchmarking your models against this resource will help you identify and mitigate tendencies towards "translationese" output, ultimately leading to more natural and idiomatic translations in non-English languages. Prioritize training strategies that reduce source language bias to improve translation quality.

Key insights

Language models often prefer literal "translationese" over idiomatic phrasing, especially when exposed to the source language.

Principles

Method

The dataset contrasts translationese with idiomatic alternatives, including error tags, to probe LLM preferences for English-to-Swedish translation.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Scientist, AI Researcher, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.