HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

HEART-Bench is a novel benchmark designed to systematically evaluate whether Large Language Model (LLM) agents can simulate coherent, human-like psychology. Introduced on 2026-05-28, this benchmark constructs 11 distinct human characters, each grounded in orthogonal Big Five personality traits and integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously assess psychological manifestations, HEART-Bench employs a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, which characterizes situations across eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. The benchmark evaluates agents' ability to consolidate their innate personality traits and memories to make behavioral decisions consistent with their specific psychological profiles, resulting in 673 multiple-choice questions after human validation.

Key takeaway

For AI Scientists and Research Scientists developing or evaluating LLM agents, HEART-Bench provides a principled testbed for assessing human-like psychological consistency. You should consider integrating structured personality traits and autobiographical memories into your agent designs. Utilize the DIAMONDS taxonomy to craft diverse decision-making scenarios, ensuring your agents' behavioral outputs align with their defined psychological profiles. This approach helps validate agents' emotional dimensions and value-consistent decision-making.

Key insights

HEART-Bench evaluates LLM agents' human-like psychological consistency using structured personality profiles and autobiographical memories.

Principles

Method

The benchmark constructs 11 Big Five personality-based characters with 1,000 episodic memories each. It uses 64 DIAMONDS-taxonomy scenarios to generate 673 MCQs for evaluating behavioral consistency.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.