QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

QMFOL is an automated framework designed to generate quantifiable Monadic First-Order Logic (MFOL) reasoning tasks, addressing limitations in existing benchmarks regarding logical complexity control and balancing semantic diversity with consistency. The framework constructs formal logical structures using conjunction and disjunction patterns, translating them into natural language via LLMs with round-trip verification by an external prover. Based on QMFOL, QMFOLBench was built, comprising 2880 instances across 960 configurations, spanning four depth levels, four width levels, three label types, five distractor levels, and four topic domains. Evaluations on six Large Reasoning Models (LRMs) and two LLMs, including Gemini-3.1-Pro and GPT-5.4-High, revealed that model performance degrades and computational overhead increases with rising logical complexity. Models performed better on True-labeled tasks than False or Unknown, and exhibited sensitivity to semantic variation.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM reasoning capabilities, you should adopt benchmarks that offer fine-grained control over logical complexity, such as QMFOL. This allows for precise identification of model weaknesses across dimensions like reasoning depth, width, and label types (True, False, Unknown). Incorporate automated logical consistency verification in your data generation pipelines to ensure reliable evaluation, especially when translating formal logic to natural language.

Key insights

QMFOL provides a verifiable framework for generating logically complex, semantically diverse deductive reasoning benchmarks for LLMs.

Principles

Logical complexity can be precisely controlled via depth, width, and distractors.
Round-trip verification ensures logical consistency in NL translation.
Model performance varies significantly with logical complexity and label type.

Method

QMFOL constructs MFOL tasks by expanding depth/width, generating fact-conclusion pairs, and adding distractors. It then translates these to NL using LLMs, verifying consistency via NL2FOL conversion and an external theorem prover.

In practice

Use MFOL for fine-grained control over reasoning task difficulty.
Implement round-trip verification for LLM-generated logical content.
Evaluate models on True, False, and Unknown labels separately.

Topics

Large Language Models
Deductive Reasoning
Monadic First-Order Logic
Benchmark Generation
Logical Complexity
Automated Theorem Provers

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.