Personalized Benchmarking: Evaluating LLMs by Individual Preferences
Summary
A study by Gârbacea, Wang, and Tan from the University of Chicago introduces personalized benchmarking for Large Language Models (LLMs), challenging the traditional aggregate evaluation methods that overlook individual user preferences. Analyzing 115 active Chatbot Arena users, the researchers computed personalized model rankings using ELO ratings and Bradley-Terry coefficients. They found that individual LLM rankings diverge significantly from aggregate rankings, with Bradley-Terry correlations averaging only ρ=0.04 (57% of users showing near-zero or negative correlation) and ELO ratings showing moderate correlation (ρ=0.43). The research identified substantial heterogeneity in user query topics and writing styles, which influence model preferences. Furthermore, a compact combination of topic and style features proved useful for predicting user-specific model rankings, demonstrating a 35% improvement in MAE for ELO and 12% for Bradley-Terry over a mean-predictor baseline.
Key takeaway
For research scientists evaluating LLMs, you should move beyond one-size-fits-all aggregate benchmarks. Recognize that individual user preferences, shaped by unique query topics and writing styles, lead to dramatically different model rankings. Incorporate personalized benchmarking by analyzing user query patterns to predict individual model alignment, ensuring that LLM deployments genuinely meet diverse user needs rather than relying on potentially misleading global leaderboards.
Key insights
Individual LLM preferences diverge significantly from aggregate benchmarks, driven by unique user query topics and writing styles.
Principles
- Aggregate LLM benchmarks often mislead.
- User query characteristics predict model preference.
- Bradley-Terry is more sensitive to individual preferences than ELO.
Method
Personalized LLM rankings are computed using ELO and Bradley-Terry models, with user query patterns analyzed via FastTopic for topics and LISA/HypoGeniC for style, then combined for prediction.
In practice
- Infer user profiles from a few queries.
- Match users to models based on similar user preferences.
- Report separate model rankings for user types.
Topics
- Personalized LLM Benchmarking
- User Preference Heterogeneity
- ELO Rating System
- Bradley-Terry Model
- Query Topic Modeling
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Product Manager
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.