The Path Not Taken: Duality in Reasoning about Program Execution
Summary
A new benchmark, DexBench, has been introduced to evaluate Large Language Models' (LLMs) understanding of program execution by focusing on "duality" in reasoning. This approach moves beyond single-path evaluations, which are prone to data contamination and offer a narrow view of dynamic code reasoning. DexBench comprises 445 paired instances derived from real-world Python programs from CruxEval, HumanEval, and PythonSaga datasets. It evaluates LLMs on two complementary tasks: forward reasoning (predicting observed program behavior for a given input, specifically code coverage) and backward reasoning (inferring input mutations required to achieve a specific counterfactual behavioral objective, like reaching an uncovered branch). Experiments with 13 LLMs, including Jamba Reasoning-3B, Llama-3.1, Gemini 2.5 Flash, and GPT-5 Mini, demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding, revealing that strong performance on isolated tasks does not guarantee success in joint evaluation.
Key takeaway
For research scientists evaluating coding LLMs, you should adopt dual-path reasoning benchmarks like DexBench to gain a more robust and holistic understanding of model capabilities. Relying solely on single-path evaluations risks overlooking critical causal reasoning gaps, as strong performance in isolated tasks does not translate to joint success. Prioritize benchmarks that require models to reason about both observed execution and counterfactual scenarios to ensure a deeper, state-aware comprehension of program flow.
Key insights
True program execution understanding requires dual-path reasoning, evaluating both observed and counterfactual behaviors.
Principles
- Single-path evaluations offer a narrow view of program understanding.
- Dual-path reasoning provides a robust proxy for causal code understanding.
- Model scaling and reasoning-focused training do not guarantee dual-path reasoning improvements.
Method
DexBench evaluates LLMs using paired forward (code coverage prediction) and backward (branch-targeted input mutation) reasoning tasks, requiring models to maintain a consistent causal representation of program behavior across execution and counterfactual paths.
In practice
- Use dual-path benchmarks for comprehensive LLM code reasoning evaluation.
- Focus on causal understanding of execution flow, not just output prediction.
- Consider input mutation tasks over input generation for deeper reasoning assessment.
Topics
- LLM Code Reasoning
- Program Execution Duality
- DexBench Benchmark
- Forward Reasoning
- Counterfactual Reasoning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.