FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users
Summary
Few-Shot Preference Optimization (FSPO) is a novel framework designed to personalize Large Language Models (LLMs) for open-ended question answering by reframing reward modeling as a meta-learning problem. This approach enables an LLM to quickly adapt to individual user preferences using only a few labeled examples from that user, thereby constructing a personalized reward function. FSPO addresses the scarcity of real-world preference data by generating over 1 million synthetic personalized preferences using publicly available LLMs, emphasizing the importance of high diversity and coherent, self-consistent structure for successful transfer to real users. The framework was evaluated across three domains: movie reviews, pedagogical adaptation based on educational background (ELIX), and general question answering (Roleplay), involving up to 1,500 synthetic users. FSPO achieved an 87% Alpaca Eval winrate on average for synthetic users and a 72% winrate in a controlled human study for open-ended question answering.
Key takeaway
For AI Engineers and Research Scientists developing personalized LLMs, FSPO offers a robust framework to achieve user-specific adaptation. You should consider implementing meta-learning objectives with synthetically generated, diverse, and structured preference datasets to overcome data scarcity and improve personalization. This approach can significantly enhance user satisfaction and inclusivity in open-ended generation tasks, as demonstrated by its 72% winrate with real human users.
Key insights
FSPO enables LLMs to personalize responses by meta-learning user preferences from few-shot synthetic data.
Principles
- Personalization requires modeling a distribution of reward functions.
- Synthetic data for meta-learning needs diversity and coherent structure.
- Few-shot preferences can represent user personas for rapid adaptation.
Method
FSPO reframes reward modeling as a black-box meta-learning problem, fine-tuning an LLM with a preference-learning objective (e.g., IPO) over user-specific preference datasets. It can optionally use a two-step User Description Chain-of-Thought (COT) for enhanced reward modeling.
In practice
- Generate synthetic preference data using LLMs for scalability.
- Employ domain randomization and iterative persona refinement for data quality.
- Use AI Feedback (GPT-4o) for consistent preference labeling.
Topics
- Few-Shot Preference Optimization
- LLM Personalization
- Meta-Learning
- Synthetic Preference Data
- Reward Modeling
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.