Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

A study re-evaluates psychometric methods for Large Language Models (LLMs), focusing on the reliability of self-reports (SR) in predicting behavior for safe deployment. Researchers contrasted the broad Big 5 personality framework with the Theory of Planned Behavior (TPB), which targets specific behavioral intentions. Experiments involved 11 frontier LLMs across four tasks: risk-taking, sycophancy, honesty, and implicit bias, varying session context and identity induction. Findings indicate that SR–behavior coherence is selective: within shared conversations, TPB achieved human-level coherence (mean r=+0.40, excluding IAT), unlike Big 5. Across separate conversations, coherence persisted only for behaviors anchored outside the immediate prompt, such as implicit bias and honesty, but collapsed for context-primed behaviors like sycophancy. Persona prompting improved SR consistency across sessions but failed to align behavior. This suggests Big 5 is inadequate for deployment behavior testing, necessitating task-specific instruments.

Key takeaway

For MLOps Engineers evaluating LLM behavior for high-stakes deployments, you should prioritize task- and behavior-specific psychometric instruments like the Theory of Planned Behavior. Avoid relying on broad personality frameworks such as Big 5, which fail to predict specific actions. When assessing context-sensitive behaviors like sycophancy, ensure self-reports and behavior are elicited in separate sessions to prevent priming effects from masking true tendencies. Misinterpreting within-session coherence as stable disposition can lead to false assurances regarding model safety.

Key insights

LLM self-reports predict behavior selectively, requiring fine-grained instruments and shared context for coherence.

Principles

Framework granularity significantly impacts LLM self-report to behavior coherence.
Cross-session coherence in LLMs is task-dependent, not a universal property.
Persona prompting enhances self-report consistency but does not align behavior.

Method

A 2x2x2 factorial design varied psychometric framework (Big 5 vs. TPB), session context (shared vs. separate), and identity induction (parameter grid vs. persona prompting) across 4 behavioral tasks and 11 frontier LLMs.

In practice

Employ Theory of Planned Behavior (TPB) for task-specific LLM behavioral predictions.
Conduct self-report and behavioral elicitation within a single conversational session.
Focus on behaviors anchored outside the immediate prompt for stable dispositional assessment.

Topics

LLM Evaluation
Psychometrics
Theory of Planned Behavior
Big Five Personality
AI Safety
Behavioral Auditing
Persona Prompting

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.