Personalized Benchmarking: Evaluating LLMs by Individual Preferences

2026-04-22 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study by Gârbacea, Wang, and Tan from the University of Chicago introduces personalized benchmarking for Large Language Models (LLMs), challenging the traditional aggregate evaluation methods that overlook individual user preferences. Analyzing 115 active Chatbot Arena users, the researchers computed personalized model rankings using ELO ratings and Bradley-Terry coefficients. They found that individual LLM rankings diverge significantly from aggregate rankings, with Bradley-Terry correlations averaging only ρ=0.04 (57% of users showing near-zero or negative correlation) and ELO ratings showing moderate correlation (ρ=0.43). The research identified substantial heterogeneity in user query topics and writing styles, which influence model preferences. Furthermore, a compact combination of topic and style features proved useful for predicting user-specific model rankings, demonstrating a 35% improvement in MAE for ELO and 12% for Bradley-Terry over a mean-predictor baseline.

Key takeaway

For research scientists evaluating LLMs, you should move beyond one-size-fits-all aggregate benchmarks. Recognize that individual user preferences, shaped by unique query topics and writing styles, lead to dramatically different model rankings. Incorporate personalized benchmarking by analyzing user query patterns to predict individual model alignment, ensuring that LLM deployments genuinely meet diverse user needs rather than relying on potentially misleading global leaderboards.

Key insights

Individual LLM preferences diverge significantly from aggregate benchmarks, driven by unique user query topics and writing styles.

Principles

Aggregate LLM benchmarks often mislead.
User query characteristics predict model preference.
Bradley-Terry is more sensitive to individual preferences than ELO.

Method

Personalized LLM rankings are computed using ELO and Bradley-Terry models, with user query patterns analyzed via FastTopic for topics and LISA/HypoGeniC for style, then combined for prediction.

In practice

Infer user profiles from a few queries.
Match users to models based on similar user preferences.
Report separate model rankings for user types.

Topics

Personalized LLM Benchmarking
User Preference Heterogeneity
ELO Rating System
Bradley-Terry Model
Query Topic Modeling

Code references

tatsu-lab/alpaca_eval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Product Manager

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.