CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

CombEval is a dynamic benchmark designed to evaluate combinatorial counting capabilities in large language models. It generates natural-language problems from typed Cofola specifications, ensuring solver-verified exact answers and enabling systematic control over problem difficulty by varying object type, entity scale, constraint count, and reasoning depth. The framework was used to evaluate 11 LLMs, including open-source models like LLaMA-3-8B-Instruct and closed-source models such as gpt-5.5 and gemini-3-flash-preview-thinking. Results indicate that while larger models show improved accuracy, all models remain brittle on tasks involving ordered objects, indistinguishable elements, relative positional constraints, and nested object dependencies. Error analysis highlights failures in constraint interpretation and fundamental counting principles, confirming CombEval's utility as a diagnostic testbed.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLMs for mathematical reasoning, you should prioritize benchmarks like CombEval that dynamically generate problems to avoid data contamination and spurious reasoning. Focus your model development on improving robustness to ordered objects, indistinguishable elements, and multi-step dependencies, as current state-of-the-art models still exhibit significant brittleness in these areas. Consider code-augmented reasoning as a diagnostic tool, but recognize its limitations for smaller models.

Key insights

LLMs struggle with combinatorial counting, especially complex constraints and nested reasoning, despite general math improvements.

Principles

Combinatorial problem difficulty scales predictably with entity size, constraint count, and reasoning depth.
LLM performance on combinatorial tasks degrades with ordered objects and multi-step dependencies.
Dynamic benchmark generation mitigates data contamination and spurious reasoning issues.

Method

CombEval generates problems from typed Cofola specifications, verbalizes them with templates, and uses a WFOMC-based solver for exact answer verification, allowing dynamic difficulty control.

In practice

Use CombEval to diagnose LLM weaknesses in combinatorial reasoning.
Test LLMs with code-augmented settings for complex combinatorial structures.
Focus on improving LLM handling of ordered objects and nested dependencies.

Topics

Large Language Models
Combinatorial Counting
Dynamic Benchmarking
Cofola Language
Mathematical Reasoning
Constraint Satisfaction

Code references

YuxuZhou-CN/combination-problem-generation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.