Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

2026-06-10 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, medium

Summary

Researchers introduced an automated, domain-agnostic framework to quantify large language model (LLM) creativity across open-ended tasks, addressing the limitations of existing task-specific metrics. This framework separates the measurement apparatus from the creative task, enabling scalable assessment. Divergent creativity is evaluated using semantic entropy, a reference-free metric for novelty and diversity, validated against human annotations and LLM-based judgments. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework, which offers over 60% improved efficiency for context-sensitive task fulfillment evaluation. The framework was validated across three distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), utilizing a broad suite of LLMs. Empirical results confirm its reliability in capturing novelty, diversity, and task fulfillment, also revealing how model properties like size, temperature, recency, and reasoning influence creative performance. This work establishes a generalizable standard for automated LLM creativity evaluation.

Key takeaway

For NLP Engineers developing or deploying large language models, traditional creativity metrics are often task-specific and unscalable. You should consider adopting this new automated, domain-agnostic framework to systematically evaluate LLM creativity across diverse open-ended tasks. This approach provides a reproducible standard for assessing novelty, diversity, and task fulfillment. It enables more robust benchmarking and accelerates progress in creative AI applications. Use it to understand how model parameters like temperature influence creative outputs.

Key insights

A new domain-agnostic framework automates large language model creativity evaluation across diverse open-ended tasks.

Principles

Separate creativity measurement from task specifics.
Distinguish divergent from convergent creativity.
Model properties impact creative output.

Method

Divergent creativity uses semantic entropy; convergent creativity uses a retrieval-based multi-agent judge for task fulfillment.

In practice

Evaluate LLMs in problem-solving, ideation, writing.
Analyze LLM size, temperature, recency, reasoning.

Topics

LLM Creativity Evaluation
Automated Evaluation Frameworks
Semantic Entropy
Multi-Agent Judge
Open-Ended Tasks
Divergent Creativity

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.