Automated Creativity Evaluation of Language Models Across Open-Ended Tasks
Summary
A new automated, domain-agnostic framework has been introduced to quantify large language model (LLM) creativity across open-ended tasks. This framework separates measurement from the creative task, enabling scalable assessment. Divergent creativity is measured using semantic entropy, a reference-free metric for novelty and diversity, validated against human annotations and LLM-based judgments. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework, which offers over 60% improved efficiency for context-sensitive task fulfillment evaluation. The framework was validated across three distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), utilizing a broad suite of LLMs. Empirical results, published on 2026-06-10, demonstrate its reliability in capturing novelty, diversity, and task fulfillment, and reveal how LLM properties like size, temperature, recency, and reasoning influence creative performance.
Key takeaway
For AI Scientists or NLP Engineers tasked with evaluating LLM creative capabilities, this framework provides a standardized, scalable approach. You should adopt its domain-agnostic metrics, like semantic entropy for divergent thinking and the multi-agent judge for convergent task fulfillment, to ensure robust assessment. This allows you to systematically benchmark models and understand how parameters like temperature or size specifically impact creative performance, accelerating progress in creative AI development.
Key insights
The framework offers a scalable, domain-agnostic method to evaluate LLM creativity by separating measurement from task.
Principles
- Creativity evaluation needs task-agnostic metrics.
- Divergent and convergent creativity require distinct measures.
- Model properties influence creative output.
Method
Divergent creativity uses semantic entropy for novelty and diversity. Convergent creativity employs a retrieval-based multi-agent judge for context-sensitive task fulfillment, improving efficiency by over 60%.
In practice
- Validate LLM creativity across diverse domains.
- Assess novelty, diversity, and task fulfillment.
- Analyze how temperature impacts creative output.
Topics
- LLM Creativity Evaluation
- Semantic Entropy
- Multi-Agent Systems
- Open-Ended Tasks
- AI Benchmarking
- Natural Language Generation
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.