CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
Summary
CLewR (Curriculum Learning with Restarts) is a novel curriculum learning strategy designed to enhance machine translation (MT) performance by integrating into existing preference optimization (PO) algorithms. This method addresses catastrophic forgetting by repeatedly iterating through an easy-to-hard data curriculum during training epochs. CLewR demonstrates consistent and statistically significant performance gains in MT across several large language model (LLM) families, including Gemma2, Qwen2.5, and Llama3.1, when applied with preference optimization techniques like DPO, CPO, and ARPO. The approach also introduces CLewR-z, which derives its curriculum score from the ARPO distance, and an enhanced ARPO variant that incorporates external semantic signals from MT metrics like BLEU and COMET-22. The code for CLewR is publicly available on GitHub.
Key takeaway
Research Scientists working on fine-tuning LLMs for machine translation should consider implementing CLewR to improve performance. By integrating this curriculum learning strategy with restarts into preference optimization algorithms like DPO, CPO, or ARPO, you can achieve consistent and statistically significant gains, particularly for generic LLMs. This approach effectively mitigates catastrophic forgetting, ensuring that models retain knowledge of easier examples while learning harder ones.
Key insights
CLewR improves machine translation by integrating curriculum learning with restarts into preference optimization to mitigate catastrophic forgetting.
Principles
- Repeated easy-to-hard curriculum prevents forgetting.
- Data ordering significantly impacts model performance.
- External MT metrics can enhance preference optimization.
Method
CLewR sorts preference triplets based on a similarity score derived from MT metrics (BLEU, COMET-22, METEOR). Training proceeds in an easy-to-hard order, with this permutation reused in every epoch to mitigate catastrophic forgetting.
In practice
- Apply CLewR to DPO, CPO, and ARPO for MT gains.
- Consider CLewR-z for ARPO to use its internal distance.
- Integrate BLEU/COMET into ARPO's distance function.
Topics
- CLewR
- Curriculum Learning
- Preference Optimization
- Machine Translation
- Large Language Models
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.