The Path Not Taken: Duality in Reasoning about Program Execution

2025-08-20 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

A new benchmark, DexBench, has been introduced to evaluate Large Language Models' (LLMs) understanding of program execution by focusing on "duality" in reasoning. This approach moves beyond single-path evaluations, which are prone to data contamination and offer a narrow view of dynamic code reasoning. DexBench comprises 445 paired instances derived from real-world Python programs from CruxEval, HumanEval, and PythonSaga datasets. It evaluates LLMs on two complementary tasks: forward reasoning (predicting observed program behavior for a given input, specifically code coverage) and backward reasoning (inferring input mutations required to achieve a specific counterfactual behavioral objective, like reaching an uncovered branch). Experiments with 13 LLMs, including Jamba Reasoning-3B, Llama-3.1, Gemini 2.5 Flash, and GPT-5 Mini, demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding, revealing that strong performance on isolated tasks does not guarantee success in joint evaluation.

Key takeaway

For research scientists evaluating coding LLMs, you should adopt dual-path reasoning benchmarks like DexBench to gain a more robust and holistic understanding of model capabilities. Relying solely on single-path evaluations risks overlooking critical causal reasoning gaps, as strong performance in isolated tasks does not translate to joint success. Prioritize benchmarks that require models to reason about both observed execution and counterfactual scenarios to ensure a deeper, state-aware comprehension of program flow.

Key insights

True program execution understanding requires dual-path reasoning, evaluating both observed and counterfactual behaviors.

Principles

Single-path evaluations offer a narrow view of program understanding.
Dual-path reasoning provides a robust proxy for causal code understanding.
Model scaling and reasoning-focused training do not guarantee dual-path reasoning improvements.

Method

DexBench evaluates LLMs using paired forward (code coverage prediction) and backward (branch-targeted input mutation) reasoning tasks, requiring models to maintain a consistent causal representation of program behavior across execution and counterfactual paths.

In practice

Use dual-path benchmarks for comprehensive LLM code reasoning evaluation.
Focus on causal understanding of execution flow, not just output prediction.
Consider input mutation tasks over input generation for deeper reasoning assessment.

Topics

LLM Code Reasoning
Program Execution Duality
DexBench Benchmark
Forward Reasoning
Counterfactual Reasoning

Code references

sail-ucf/dexbench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.