Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A novel in-context learning approach significantly improves low-resource machine translation from Coptic to English by augmenting prompts with syntactic information from Universal Dependencies (UD) parses. Researchers from Georgetown University's Corpling Lab found that while dictionary-based glosses (LEX) are effective, combining them with syntactic data (SYN) yields new state-of-the-art results for Coptic translation across various model sizes, including Gemma and GPT-4.1. The method explores different syntactic representations, such as raw parser outputs (CoNLLU), verbalized dependency relations (DEP), and targeted instructions for difficult constructions (CON). This combined approach consistently outperforms lexicon-only augmentation, demonstrating that even high-quality automatic parses are sufficient and that the benefits extend to out-of-domain texts like Coptic ostraca.

Key takeaway

For research scientists developing machine translation systems for low-resource languages, integrating syntactic information from Universal Dependencies parses alongside traditional dictionary-based augmentation is crucial. This approach, particularly the LEX+SYN combination, has been shown to achieve statistically significant improvements in translation quality for Coptic. You should consider operationalizing syntactic data through raw CoNLL-U outputs, verbalized dependency relations, and targeted instructions for difficult grammatical constructions to maximize performance, even with automatically generated parses.

Key insights

Syntactic augmentation, combined with lexical information, significantly boosts low-resource machine translation quality.

Principles

Method

The method involves augmenting LLM prompts with dictionary entries and various forms of Universal Dependencies syntactic parses, including raw CoNLL-U, verbalized dependencies, and targeted construction-specific instructions.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.