Steerable Cultural Preference Optimization of Reward Models
Summary
Steerable Cultural Preference Optimization (SCPO) is a novel reward model training algorithm designed to align large language models (LLMs) with diverse cultural sub-communities. This method addresses the current limitation of LLM alignment research, which often focuses on unified preference prediction, by enabling models to accurately represent sub-community preferences without excessive bias. SCPO achieves performance increases of up to 7 points for minority reward models over baseline models across two datasets, PRISM and GlobalOpinionQA, and seven countries (Chile, South Africa, New Zealand, Australia, Mexico, Israel, Canada). The algorithm is also up to 280% more training data-efficient than full-data finetuning. SCPO operates by utilizing a "global" reward model to filter out universal preferences and assign lower training weights to highly divergent preferences, thereby mitigating bias and emphasizing subtle cultural distinctions.
Key takeaway
For Machine Learning Engineers developing LLMs for global markets, SCPO offers a robust method to achieve pluralistic alignment. You should consider implementing this filtering and weighting approach with a global reward model to efficiently train culturally-aware RMs. This can significantly improve minority preference alignment, reduce bias, and enhance data efficiency by up to 280% compared to traditional finetuning, ensuring your models serve diverse communities effectively.
Key insights
SCPO aligns LLMs to diverse cultural preferences by filtering universal data and weighting divergent opinions.
Principles
- LLM alignment requires representing diverse cultural sub-communities.
- Global reward models can identify distinct cultural preferences.
- Down-weighting highly divergent preferences mitigates bias.
Method
SCPO filters minority preference pairs that agree with a global reward model. It then applies a weighted training loss, inversely assigning weights based on preference pair divergence from the global model.
In practice
- Use global RMs (e.g., OpenAssistant, Tülu 3) as reference.
- Filter training data to retain culturally distinctive preferences.
- Apply inverse weighting to dampen highly divergent preferences.
Topics
- LLM Alignment
- Cultural Preferences
- Reward Models
- Preference Optimization
- Data Efficiency
- Bias Mitigation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.