CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
Summary
CreativeBench is a new benchmark introduced to quantitatively evaluate machine creativity in code generation, addressing the current lack of rigorous assessment for evolutionary systems like AlphaEvolve. Comprising CreativeBench-Combo for combinatorial creativity and CreativeBench-Explore for exploratory creativity, the benchmark employs an automated pipeline utilizing reverse engineering and self-play. It objectively distinguishes creativity from hallucination using a unified metric defined as the product of quality and novelty. Analysis of state-of-the-art models revealed that scaling significantly improves combinatorial creativity but yields diminishing returns for exploration, larger models exhibit "convergence-by-scaling" (more correct but less divergent), and reasoning primarily benefits constrained exploration. The paper also proposes EvoRePE, a plug-and-play inference-time steering strategy designed to consistently enhance machine creativity.
Key takeaway
For machine learning engineers developing creative code generation systems, CreativeBench provides a critical evaluation tool. You should utilize its combinatorial and exploratory subsets to rigorously assess model performance and identify specific creativity limitations. Consider integrating the EvoRePE steering strategy to consistently enhance your models' creative outputs, particularly when scaling leads to less divergent results, ensuring your systems maintain both correctness and novelty.
Key insights
CreativeBench provides a quantitative benchmark for machine creativity in code generation, addressing evaluation gaps.
Principles
- Quantitative evaluation is crucial for evolutionary AI systems.
- Creativity is objectively defined as quality times novelty.
- Model scaling impacts combinatorial and exploratory creativity differently.
Method
CreativeBench employs reverse engineering and self-play for automated code generation evaluation, while EvoRePE offers an inference-time steering strategy.
In practice
- Benchmark code generation models with CreativeBench-Combo and -Explore.
- Apply EvoRePE to improve model creativity during inference.
- Analyze model behavior using the quality-novelty metric.
Topics
- Machine Creativity
- Code Generation
- AI Benchmarking
- Evolutionary Systems
- Evaluation Metrics
- Inference Steering
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.