CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models
Summary
CombEval is a dynamic benchmark designed to evaluate combinatorial counting capabilities in large language models. It generates natural-language problems from typed Cofola specifications, ensuring solver-verified exact answers and enabling systematic control over problem difficulty by varying object type, entity scale, constraint count, and reasoning depth. The framework was used to evaluate 11 LLMs, including open-source models like LLaMA-3-8B-Instruct and closed-source models such as gpt-5.5 and gemini-3-flash-preview-thinking. Results indicate that while larger models show improved accuracy, all models remain brittle on tasks involving ordered objects, indistinguishable elements, relative positional constraints, and nested object dependencies. Error analysis highlights failures in constraint interpretation and fundamental counting principles, confirming CombEval's utility as a diagnostic testbed.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLMs for mathematical reasoning, you should prioritize benchmarks like CombEval that dynamically generate problems to avoid data contamination and spurious reasoning. Focus your model development on improving robustness to ordered objects, indistinguishable elements, and multi-step dependencies, as current state-of-the-art models still exhibit significant brittleness in these areas. Consider code-augmented reasoning as a diagnostic tool, but recognize its limitations for smaller models.
Key insights
LLMs struggle with combinatorial counting, especially complex constraints and nested reasoning, despite general math improvements.
Principles
- Combinatorial problem difficulty scales predictably with entity size, constraint count, and reasoning depth.
- LLM performance on combinatorial tasks degrades with ordered objects and multi-step dependencies.
- Dynamic benchmark generation mitigates data contamination and spurious reasoning issues.
Method
CombEval generates problems from typed Cofola specifications, verbalizes them with templates, and uses a WFOMC-based solver for exact answer verification, allowing dynamic difficulty control.
In practice
- Use CombEval to diagnose LLM weaknesses in combinatorial reasoning.
- Test LLMs with code-augmented settings for complex combinatorial structures.
- Focus on improving LLM handling of ordered objects and nested dependencies.
Topics
- Large Language Models
- Combinatorial Counting
- Dynamic Benchmarking
- Cofola Language
- Mathematical Reasoning
- Constraint Satisfaction
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.