Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new automated, domain-agnostic framework has been introduced to quantify large language model (LLM) creativity across open-ended tasks. This framework separates measurement from the creative task, enabling scalable assessment. Divergent creativity is measured using semantic entropy, a reference-free metric for novelty and diversity, validated against human annotations and LLM-based judgments. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework, which offers over 60% improved efficiency for context-sensitive task fulfillment evaluation. The framework was validated across three distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), utilizing a broad suite of LLMs. Empirical results, published on 2026-06-10, demonstrate its reliability in capturing novelty, diversity, and task fulfillment, and reveal how LLM properties like size, temperature, recency, and reasoning influence creative performance.

Key takeaway

For AI Scientists or NLP Engineers tasked with evaluating LLM creative capabilities, this framework provides a standardized, scalable approach. You should adopt its domain-agnostic metrics, like semantic entropy for divergent thinking and the multi-agent judge for convergent task fulfillment, to ensure robust assessment. This allows you to systematically benchmark models and understand how parameters like temperature or size specifically impact creative performance, accelerating progress in creative AI development.

Key insights

The framework offers a scalable, domain-agnostic method to evaluate LLM creativity by separating measurement from task.

Principles

Method

Divergent creativity uses semantic entropy for novelty and diversity. Convergent creativity employs a retrieval-based multi-agent judge for context-sensitive task fulfillment, improving efficiency by over 60%.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.