HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
Summary
HEART-Bench is a novel benchmark designed to systematically evaluate whether Large Language Model (LLM) agents can simulate coherent, human-like psychology. Introduced on 2026-05-28, this benchmark constructs 11 distinct human characters, each grounded in orthogonal Big Five personality traits and integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously assess psychological manifestations, HEART-Bench employs a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, which characterizes situations across eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. The benchmark evaluates agents' ability to consolidate their innate personality traits and memories to make behavioral decisions consistent with their specific psychological profiles, resulting in 673 multiple-choice questions after human validation.
Key takeaway
For AI Scientists and Research Scientists developing or evaluating LLM agents, HEART-Bench provides a principled testbed for assessing human-like psychological consistency. You should consider integrating structured personality traits and autobiographical memories into your agent designs. Utilize the DIAMONDS taxonomy to craft diverse decision-making scenarios, ensuring your agents' behavioral outputs align with their defined psychological profiles. This approach helps validate agents' emotional dimensions and value-consistent decision-making.
Key insights
HEART-Bench evaluates LLM agents' human-like psychological consistency using structured personality profiles and autobiographical memories.
Principles
- Human psychology in LLMs requires emotional dimensions.
- Personality traits and memories drive consistent decisions.
- Situational context shapes behavioral manifestations.
Method
The benchmark constructs 11 Big Five personality-based characters with 1,000 episodic memories each. It uses 64 DIAMONDS-taxonomy scenarios to generate 673 MCQs for evaluating behavioral consistency.
In practice
- Test LLM agents for personality consistency.
- Integrate autobiographical memories into agent profiles.
- Design scenarios using the DIAMONDS taxonomy.
Topics
- LLM Agents
- Personality Simulation
- Benchmarking
- Autobiographical Memory
- DIAMONDS Taxonomy
- Big Five Personality Traits
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.