QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

QMFOL is an automated framework designed to generate quantifiable Monadic First-Order Logic (MFOL) reasoning tasks, addressing limitations in existing benchmarks regarding logical complexity control and balancing semantic diversity with consistency. The framework constructs formal logical structures using conjunction and disjunction patterns, translating them into natural language via LLMs with round-trip verification by an external prover. Based on QMFOL, QMFOLBench was built, comprising 2880 instances across 960 configurations, spanning four depth levels, four width levels, three label types, five distractor levels, and four topic domains. Evaluations on six Large Reasoning Models (LRMs) and two LLMs, including Gemini-3.1-Pro and GPT-5.4-High, revealed that model performance degrades and computational overhead increases with rising logical complexity. Models performed better on True-labeled tasks than False or Unknown, and exhibited sensitivity to semantic variation.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM reasoning capabilities, you should adopt benchmarks that offer fine-grained control over logical complexity, such as QMFOL. This allows for precise identification of model weaknesses across dimensions like reasoning depth, width, and label types (True, False, Unknown). Incorporate automated logical consistency verification in your data generation pipelines to ensure reliable evaluation, especially when translating formal logic to natural language.

Key insights

QMFOL provides a verifiable framework for generating logically complex, semantically diverse deductive reasoning benchmarks for LLMs.

Principles

Method

QMFOL constructs MFOL tasks by expanding depth/width, generating fact-conclusion pairs, and adding distractors. It then translates these to NL using LLMs, verifying consistency via NL2FOL conversion and an external theorem prover.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.