CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

CDR-Bench is a new benchmark designed to evaluate Large Language Models' (LLMs) ability to faithfully execute compositional, order-sensitive data refinement recipes. This benchmark addresses a gap in existing evaluations by focusing specifically on multi-step text processing where both operator composition and execution order are critical to the outcome. CDR-Bench comprises 3,462 high-quality tasks across four real-world data refinement domains and utilizes 29 distinct operators. It enables exact evaluation through deterministic reference outputs, assessing models in atomic, order-agnostic, and order-sensitive scenarios. Experiments conducted on over 10 state-of-the-art LLMs consistently show significant performance degradation in compositional settings and a collapse in success rates for order-sensitive recipes, indicating current LLMs lack the necessary procedural faithfulness for reliable compositional data refinement.

Key takeaway

For Machine Learning Engineers designing data refinement pipelines, recognize that current LLMs exhibit significant limitations in executing compositional and order-sensitive text processing tasks. Your reliance on LLMs for multi-step data refinement, especially where operator sequence matters, will likely lead to unreliable outcomes. Prioritize robust validation and consider alternative or hybrid approaches for tasks requiring high procedural faithfulness, as LLM performance collapses in these complex scenarios.

Key insights

Current LLMs struggle with the procedural faithfulness required for compositional, order-sensitive data refinement recipes.

Principles

Composition and order are critical in data refinement.
LLM performance degrades sharply in compositional tasks.
Order-sensitive recipe execution is a major LLM weakness.

Method

CDR-Bench evaluates LLMs using 3,462 tasks across 4 domains and 29 operators, assessing atomic, order-agnostic, and order-sensitive settings with deterministic reference outputs.

Topics

CDR-Bench
Large Language Models
Data Refinement
Compositional Reasoning
Procedural Faithfulness
Benchmark Evaluation

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.