CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

CLewR (Curriculum Learning with Restarts) is a novel curriculum learning strategy designed to enhance machine translation (MT) performance by integrating into existing preference optimization (PO) algorithms. This method addresses catastrophic forgetting by repeatedly iterating through an easy-to-hard data curriculum during training epochs. CLewR demonstrates consistent and statistically significant performance gains in MT across several large language model (LLM) families, including Gemma2, Qwen2.5, and Llama3.1, when applied with preference optimization techniques like DPO, CPO, and ARPO. The approach also introduces CLewR-z, which derives its curriculum score from the ARPO distance, and an enhanced ARPO variant that incorporates external semantic signals from MT metrics like BLEU and COMET-22. The code for CLewR is publicly available on GitHub.

Key takeaway

Research Scientists working on fine-tuning LLMs for machine translation should consider implementing CLewR to improve performance. By integrating this curriculum learning strategy with restarts into preference optimization algorithms like DPO, CPO, or ARPO, you can achieve consistent and statistically significant gains, particularly for generic LLMs. This approach effectively mitigates catastrophic forgetting, ensuring that models retain knowledge of easier examples while learning harder ones.

Key insights

CLewR improves machine translation by integrating curriculum learning with restarts into preference optimization to mitigate catastrophic forgetting.

Principles

Repeated easy-to-hard curriculum prevents forgetting.
Data ordering significantly impacts model performance.
External MT metrics can enhance preference optimization.

Method

CLewR sorts preference triplets based on a similarity score derived from MT metrics (BLEU, COMET-22, METEOR). Training proceeds in an easy-to-hard order, with this permutation reused in every epoch to mitigate catastrophic forgetting.

In practice

Apply CLewR to DPO, CPO, and ARPO for MT gains.
Consider CLewR-z for ARPO to use its internal distance.
Integrate BLEU/COMET into ARPO's distance function.

Topics

CLewR
Curriculum Learning
Preference Optimization
Machine Translation
Large Language Models

Code references

alexandra-dragomir/CLewR

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.