CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

2026-03-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

CreativeBench is a new benchmark introduced to quantitatively evaluate machine creativity in code generation, addressing the current lack of rigorous assessment for evolutionary systems like AlphaEvolve. Comprising CreativeBench-Combo for combinatorial creativity and CreativeBench-Explore for exploratory creativity, the benchmark employs an automated pipeline utilizing reverse engineering and self-play. It objectively distinguishes creativity from hallucination using a unified metric defined as the product of quality and novelty. Analysis of state-of-the-art models revealed that scaling significantly improves combinatorial creativity but yields diminishing returns for exploration, larger models exhibit "convergence-by-scaling" (more correct but less divergent), and reasoning primarily benefits constrained exploration. The paper also proposes EvoRePE, a plug-and-play inference-time steering strategy designed to consistently enhance machine creativity.

Key takeaway

For machine learning engineers developing creative code generation systems, CreativeBench provides a critical evaluation tool. You should utilize its combinatorial and exploratory subsets to rigorously assess model performance and identify specific creativity limitations. Consider integrating the EvoRePE steering strategy to consistently enhance your models' creative outputs, particularly when scaling leads to less divergent results, ensuring your systems maintain both correctness and novelty.

Key insights

CreativeBench provides a quantitative benchmark for machine creativity in code generation, addressing evaluation gaps.

Principles

Quantitative evaluation is crucial for evolutionary AI systems.
Creativity is objectively defined as quality times novelty.
Model scaling impacts combinatorial and exploratory creativity differently.

Method

CreativeBench employs reverse engineering and self-play for automated code generation evaluation, while EvoRePE offers an inference-time steering strategy.

In practice

Benchmark code generation models with CreativeBench-Combo and -Explore.
Apply EvoRePE to improve model creativity during inference.
Analyze model behavior using the quality-novelty metric.

Topics

Machine Creativity
Code Generation
AI Benchmarking
Evolutionary Systems
Evaluation Metrics
Inference Steering

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.