Can Large Language Models Reason About Complex Execution Paths? An Empirical Study on Python
Summary
The study investigates the potential of large language models (LLMs) to solve complex execution path constraints in symbolic execution, a critical software analysis technique. Researchers conducted an empirical study using Python, evaluating LLMs on two tasks: generating test inputs for specific execution paths and classifying path feasibility for bug detection. New benchmarks were built from 210 competition-level programs and 11 real-world repositories. State-of-the-art LLMs, particularly Large Reasoning Models (LRMs) like o4-mini, achieved up to 65.6% path accuracy in test case generation and 82.9% classification accuracy in detecting division-by-zero bugs. LLMs also improved test coverage in real-world repositories, reaching 83.2% overall line coverage. However, LRMs showed "overthinking" issues in classification and most LLMs were less time-efficient than traditional symbolic execution tools like CrossHair (13.0 seconds vs. 15-60 seconds per path).
Key takeaway
For Software Engineers and QA professionals struggling with complex path constraints in Python, consider integrating LLMs into your testing workflow. LLMs demonstrate strong capabilities in generating test cases for intricate execution paths and classifying potential bugs like division-by-zero errors, especially where traditional symbolic execution tools fail. While current LLMs may be slower than traditional tools for simple cases, their ability to handle complex, real-world scenarios and improve test coverage makes them a valuable augmentation for robust software analysis.
Key insights
Large Language Models can effectively solve complex execution path constraints, enhancing symbolic execution for software testing and debugging.
Principles
- LRMs consistently outperform non-reasoning LLMs in test case generation.
- LLMs can significantly improve test coverage in real-world software.
- Extensive LRM reasoning (overthinking) can degrade classification accuracy.
Method
The study involved constructing benchmarks from competition problems and real-world repositories, then evaluating LLMs on test case generation (using a customized Python trace library) and path classification (using CFG traversal and manual annotation).
In practice
- Use LLMs for generating test cases covering complex control flows.
- Employ LLMs to classify execution paths for bug detection.
- Integrate LLMs with traditional tools for efficiency on complex paths.
Topics
- Symbolic Execution
- Large Language Models
- Software Testing
- Path Constraint Solving
- Test Case Generation
- Bug Detection
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.