NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs
Summary
NICE, a new theory-grounded diagnostic benchmark, addresses the critical need to measure large language model (LLM) social intelligence, especially for applications like emotional companionship and customer service. Existing benchmarks often lack a unified framework for fine-grained diagnosis of social abilities. To overcome this, researchers developed a social intelligence framework through literature review and expert validation, comprising 4 categories and 11 dimensions, each with fine-grained capability facets. This framework underpins NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark featuring 137 items operationalized in Chinese contexts. Evaluations across 5 frontier LLMs and a human reference group revealed that while models achieve higher aggregate accuracy, they consistently exhibit weakness in "Communication," specifically in multi-turn communication, nonverbal communication, and synchrony. NICE thus provides a method for theory-grounded diagnosis of socially consequential LLM weaknesses.
Key takeaway
For AI scientists and ML engineers deploying LLMs in social interaction contexts, you should recognize that aggregate accuracy metrics are insufficient for assessing social intelligence. Instead, utilize theory-grounded diagnostic benchmarks like NICE to pinpoint specific weaknesses, such as those in multi-turn or nonverbal communication. This approach enables targeted model improvements, enhancing safety and quality in human-AI interactions.
Key insights
NICE provides a theory-grounded diagnostic benchmark to identify specific social intelligence weaknesses in LLMs, crucial for their application in social contexts.
Principles
- Social intelligence evaluation needs a unified, theory-grounded framework.
- Fine-grained diagnosis identifies specific LLM social capability deficits.
- Psychometric principles guide robust framework construction.
Method
Construct a social intelligence framework via literature review and expert validation, then operationalize it into a diagnostic benchmark with context-specific items for fine-grained LLM evaluation.
In practice
- Diagnose LLM weaknesses in multi-turn communication.
- Identify deficits in nonverbal communication understanding.
- Pinpoint issues related to synchrony in interactions.
Topics
- Large Language Models
- Social Intelligence
- Diagnostic Benchmarks
- Human-AI Interaction
- Psychometrics
- Communication Skills
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.