Can Large Language Models Reason About Complex Execution Paths? An Empirical Study on Python

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

The study investigates the potential of large language models (LLMs) to solve complex execution path constraints in symbolic execution, a critical software analysis technique. Researchers conducted an empirical study using Python, evaluating LLMs on two tasks: generating test inputs for specific execution paths and classifying path feasibility for bug detection. New benchmarks were built from 210 competition-level programs and 11 real-world repositories. State-of-the-art LLMs, particularly Large Reasoning Models (LRMs) like o4-mini, achieved up to 65.6% path accuracy in test case generation and 82.9% classification accuracy in detecting division-by-zero bugs. LLMs also improved test coverage in real-world repositories, reaching 83.2% overall line coverage. However, LRMs showed "overthinking" issues in classification and most LLMs were less time-efficient than traditional symbolic execution tools like CrossHair (13.0 seconds vs. 15-60 seconds per path).

Key takeaway

For Software Engineers and QA professionals struggling with complex path constraints in Python, consider integrating LLMs into your testing workflow. LLMs demonstrate strong capabilities in generating test cases for intricate execution paths and classifying potential bugs like division-by-zero errors, especially where traditional symbolic execution tools fail. While current LLMs may be slower than traditional tools for simple cases, their ability to handle complex, real-world scenarios and improve test coverage makes them a valuable augmentation for robust software analysis.

Key insights

Large Language Models can effectively solve complex execution path constraints, enhancing symbolic execution for software testing and debugging.

Principles

Method

The study involved constructing benchmarks from competition problems and real-world repositories, then evaluating LLMs on test case generation (using a customized Python trace library) and path classification (using CFG traversal and manual annotation).

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.