NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

NICE, a new theory-grounded diagnostic benchmark, addresses the critical need to measure large language model (LLM) social intelligence, especially for applications like emotional companionship and customer service. Existing benchmarks often lack a unified framework for fine-grained diagnosis of social abilities. To overcome this, researchers developed a social intelligence framework through literature review and expert validation, comprising 4 categories and 11 dimensions, each with fine-grained capability facets. This framework underpins NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark featuring 137 items operationalized in Chinese contexts. Evaluations across 5 frontier LLMs and a human reference group revealed that while models achieve higher aggregate accuracy, they consistently exhibit weakness in "Communication," specifically in multi-turn communication, nonverbal communication, and synchrony. NICE thus provides a method for theory-grounded diagnosis of socially consequential LLM weaknesses.

Key takeaway

For AI scientists and ML engineers deploying LLMs in social interaction contexts, you should recognize that aggregate accuracy metrics are insufficient for assessing social intelligence. Instead, utilize theory-grounded diagnostic benchmarks like NICE to pinpoint specific weaknesses, such as those in multi-turn or nonverbal communication. This approach enables targeted model improvements, enhancing safety and quality in human-AI interactions.

Key insights

NICE provides a theory-grounded diagnostic benchmark to identify specific social intelligence weaknesses in LLMs, crucial for their application in social contexts.

Principles

Method

Construct a social intelligence framework via literature review and expert validation, then operationalize it into a diagnostic benchmark with context-specific items for fine-grained LLM evaluation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.