Automated Creativity Evaluation of Language Models Across Open-Ended Tasks
Summary
Researchers introduced an automated, domain-agnostic framework to quantify large language model (LLM) creativity across open-ended tasks, addressing the limitations of existing task-specific metrics. This framework separates the measurement apparatus from the creative task, enabling scalable assessment. Divergent creativity is evaluated using semantic entropy, a reference-free metric for novelty and diversity, validated against human annotations and LLM-based judgments. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework, which offers over 60% improved efficiency for context-sensitive task fulfillment evaluation. The framework was validated across three distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), utilizing a broad suite of LLMs. Empirical results confirm its reliability in capturing novelty, diversity, and task fulfillment, also revealing how model properties like size, temperature, recency, and reasoning influence creative performance. This work establishes a generalizable standard for automated LLM creativity evaluation.
Key takeaway
For NLP Engineers developing or deploying large language models, traditional creativity metrics are often task-specific and unscalable. You should consider adopting this new automated, domain-agnostic framework to systematically evaluate LLM creativity across diverse open-ended tasks. This approach provides a reproducible standard for assessing novelty, diversity, and task fulfillment. It enables more robust benchmarking and accelerates progress in creative AI applications. Use it to understand how model parameters like temperature influence creative outputs.
Key insights
A new domain-agnostic framework automates large language model creativity evaluation across diverse open-ended tasks.
Principles
- Separate creativity measurement from task specifics.
- Distinguish divergent from convergent creativity.
- Model properties impact creative output.
Method
Divergent creativity uses semantic entropy; convergent creativity uses a retrieval-based multi-agent judge for task fulfillment.
In practice
- Evaluate LLMs in problem-solving, ideation, writing.
- Analyze LLM size, temperature, recency, reasoning.
Topics
- LLM Creativity Evaluation
- Automated Evaluation Frameworks
- Semantic Entropy
- Multi-Agent Judge
- Open-Ended Tasks
- Divergent Creativity
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.