Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Summary
A study re-evaluates psychometric methods for Large Language Models (LLMs), focusing on the reliability of self-reports (SR) in predicting behavior for safe deployment. Researchers contrasted the broad Big 5 personality framework with the Theory of Planned Behavior (TPB), which targets specific behavioral intentions. Experiments involved 11 frontier LLMs across four tasks: risk-taking, sycophancy, honesty, and implicit bias, varying session context and identity induction. Findings indicate that SR–behavior coherence is selective: within shared conversations, TPB achieved human-level coherence (mean r=+0.40, excluding IAT), unlike Big 5. Across separate conversations, coherence persisted only for behaviors anchored outside the immediate prompt, such as implicit bias and honesty, but collapsed for context-primed behaviors like sycophancy. Persona prompting improved SR consistency across sessions but failed to align behavior. This suggests Big 5 is inadequate for deployment behavior testing, necessitating task-specific instruments.
Key takeaway
For MLOps Engineers evaluating LLM behavior for high-stakes deployments, you should prioritize task- and behavior-specific psychometric instruments like the Theory of Planned Behavior. Avoid relying on broad personality frameworks such as Big 5, which fail to predict specific actions. When assessing context-sensitive behaviors like sycophancy, ensure self-reports and behavior are elicited in separate sessions to prevent priming effects from masking true tendencies. Misinterpreting within-session coherence as stable disposition can lead to false assurances regarding model safety.
Key insights
LLM self-reports predict behavior selectively, requiring fine-grained instruments and shared context for coherence.
Principles
- Framework granularity significantly impacts LLM self-report to behavior coherence.
- Cross-session coherence in LLMs is task-dependent, not a universal property.
- Persona prompting enhances self-report consistency but does not align behavior.
Method
A 2x2x2 factorial design varied psychometric framework (Big 5 vs. TPB), session context (shared vs. separate), and identity induction (parameter grid vs. persona prompting) across 4 behavioral tasks and 11 frontier LLMs.
In practice
- Employ Theory of Planned Behavior (TPB) for task-specific LLM behavioral predictions.
- Conduct self-report and behavioral elicitation within a single conversational session.
- Focus on behaviors anchored outside the immediate prompt for stable dispositional assessment.
Topics
- LLM Evaluation
- Psychometrics
- Theory of Planned Behavior
- Big Five Personality
- AI Safety
- Behavioral Auditing
- Persona Prompting
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.