FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

2025-02-15 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Few-Shot Preference Optimization (FSPO) is a novel framework designed to personalize Large Language Models (LLMs) for open-ended question answering by reframing reward modeling as a meta-learning problem. This approach enables an LLM to quickly adapt to individual user preferences using only a few labeled examples from that user, thereby constructing a personalized reward function. FSPO addresses the scarcity of real-world preference data by generating over 1 million synthetic personalized preferences using publicly available LLMs, emphasizing the importance of high diversity and coherent, self-consistent structure for successful transfer to real users. The framework was evaluated across three domains: movie reviews, pedagogical adaptation based on educational background (ELIX), and general question answering (Roleplay), involving up to 1,500 synthetic users. FSPO achieved an 87% Alpaca Eval winrate on average for synthetic users and a 72% winrate in a controlled human study for open-ended question answering.

Key takeaway

For AI Engineers and Research Scientists developing personalized LLMs, FSPO offers a robust framework to achieve user-specific adaptation. You should consider implementing meta-learning objectives with synthetically generated, diverse, and structured preference datasets to overcome data scarcity and improve personalization. This approach can significantly enhance user satisfaction and inclusivity in open-ended generation tasks, as demonstrated by its 72% winrate with real human users.

Key insights

FSPO enables LLMs to personalize responses by meta-learning user preferences from few-shot synthetic data.

Principles

Personalization requires modeling a distribution of reward functions.
Synthetic data for meta-learning needs diversity and coherent structure.
Few-shot preferences can represent user personas for rapid adaptation.

Method

FSPO reframes reward modeling as a black-box meta-learning problem, fine-tuning an LLM with a preference-learning objective (e.g., IPO) over user-specific preference datasets. It can optionally use a two-step User Description Chain-of-Thought (COT) for enhanced reward modeling.

In practice

Generate synthetic preference data using LLMs for scalability.
Employ domain randomization and iterative persona refinement for data quality.
Use AI Feedback (GPT-4o) for consistent preference labeling.

Topics

Few-Shot Preference Optimization
LLM Personalization
Meta-Learning
Synthetic Preference Data
Reward Modeling

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.