CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

CombEval is a dynamic benchmark designed to evaluate combinatorial counting capabilities in large language models. It generates natural-language problems from typed Cofola specifications, ensuring solver-verified exact answers and enabling systematic control over problem difficulty by varying object type, entity scale, constraint count, and reasoning depth. The framework was used to evaluate 11 LLMs, including open-source models like LLaMA-3-8B-Instruct and closed-source models such as gpt-5.5 and gemini-3-flash-preview-thinking. Results indicate that while larger models show improved accuracy, all models remain brittle on tasks involving ordered objects, indistinguishable elements, relative positional constraints, and nested object dependencies. Error analysis highlights failures in constraint interpretation and fundamental counting principles, confirming CombEval's utility as a diagnostic testbed.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLMs for mathematical reasoning, you should prioritize benchmarks like CombEval that dynamically generate problems to avoid data contamination and spurious reasoning. Focus your model development on improving robustness to ordered objects, indistinguishable elements, and multi-step dependencies, as current state-of-the-art models still exhibit significant brittleness in these areas. Consider code-augmented reasoning as a diagnostic tool, but recognize its limitations for smaller models.

Key insights

LLMs struggle with combinatorial counting, especially complex constraints and nested reasoning, despite general math improvements.

Principles

Method

CombEval generates problems from typed Cofola specifications, verbalizes them with templates, and uses a WFOMC-based solver for exact answer verification, allowing dynamic difficulty control.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.