QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
Summary
QMFOL is an automated framework designed to generate quantifiable Monadic First-Order Logic (MFOL) reasoning tasks, addressing limitations in existing benchmarks regarding logical complexity control and balancing semantic diversity with consistency. The framework constructs formal logical structures using conjunction and disjunction patterns, translating them into natural language via LLMs with round-trip verification by an external prover. Based on QMFOL, QMFOLBench was built, comprising 2880 instances across 960 configurations, spanning four depth levels, four width levels, three label types, five distractor levels, and four topic domains. Evaluations on six Large Reasoning Models (LRMs) and two LLMs, including Gemini-3.1-Pro and GPT-5.4-High, revealed that model performance degrades and computational overhead increases with rising logical complexity. Models performed better on True-labeled tasks than False or Unknown, and exhibited sensitivity to semantic variation.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM reasoning capabilities, you should adopt benchmarks that offer fine-grained control over logical complexity, such as QMFOL. This allows for precise identification of model weaknesses across dimensions like reasoning depth, width, and label types (True, False, Unknown). Incorporate automated logical consistency verification in your data generation pipelines to ensure reliable evaluation, especially when translating formal logic to natural language.
Key insights
QMFOL provides a verifiable framework for generating logically complex, semantically diverse deductive reasoning benchmarks for LLMs.
Principles
- Logical complexity can be precisely controlled via depth, width, and distractors.
- Round-trip verification ensures logical consistency in NL translation.
- Model performance varies significantly with logical complexity and label type.
Method
QMFOL constructs MFOL tasks by expanding depth/width, generating fact-conclusion pairs, and adding distractors. It then translates these to NL using LLMs, verifying consistency via NL2FOL conversion and an external theorem prover.
In practice
- Use MFOL for fine-grained control over reasoning task difficulty.
- Implement round-trip verification for LLM-generated logical content.
- Evaluate models on True, False, and Unknown labels separately.
Topics
- Large Language Models
- Deductive Reasoning
- Monadic First-Order Logic
- Benchmark Generation
- Logical Complexity
- Automated Theorem Provers
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.