HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

HEART-Bench is a novel benchmark designed to systematically evaluate whether Large Language Model (LLM) agents can simulate coherent, human-like psychology. Introduced on 2026-05-28, this benchmark constructs 11 distinct human characters, each grounded in orthogonal Big Five personality traits and integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously assess psychological manifestations, HEART-Bench employs a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, which characterizes situations across eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. The benchmark evaluates agents' ability to consolidate their innate personality traits and memories to make behavioral decisions consistent with their specific psychological profiles, resulting in 673 multiple-choice questions after human validation.

Key takeaway

For AI Scientists and Research Scientists developing or evaluating LLM agents, HEART-Bench provides a principled testbed for assessing human-like psychological consistency. You should consider integrating structured personality traits and autobiographical memories into your agent designs. Utilize the DIAMONDS taxonomy to craft diverse decision-making scenarios, ensuring your agents' behavioral outputs align with their defined psychological profiles. This approach helps validate agents' emotional dimensions and value-consistent decision-making.

Key insights

HEART-Bench evaluates LLM agents' human-like psychological consistency using structured personality profiles and autobiographical memories.

Principles

Human psychology in LLMs requires emotional dimensions.
Personality traits and memories drive consistent decisions.
Situational context shapes behavioral manifestations.

Method

The benchmark constructs 11 Big Five personality-based characters with 1,000 episodic memories each. It uses 64 DIAMONDS-taxonomy scenarios to generate 673 MCQs for evaluating behavioral consistency.

In practice

Test LLM agents for personality consistency.
Integrate autobiographical memories into agent profiles.
Design scenarios using the DIAMONDS taxonomy.

Topics

LLM Agents
Personality Simulation
Benchmarking
Autobiographical Memory
DIAMONDS Taxonomy
Big Five Personality Traits

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.