Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action
Summary
The paper "Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action" presents the first dedicated survey on code reasoning techniques for software engineering (SWE) tasks. It examines how large language models (LLMs) perform complex tasks like code generation, translation, summarization, and repair, particularly for real-world GitHub issue resolution. The survey introduces a taxonomy of techniques, including Code Chain-of-Thought (CoT) reasoning, execution-based reasoning, and inference scaling, alongside a focus on agentic and non-agentic SWE tasks. It also provides a comprehensive overview of performance on common benchmarks like APPS, HumanEval, MBPP, and SWE-bench, highlighting under-explored benchmarks and future research gaps.
Key takeaway
For AI Scientists and Machine Learning Engineers developing code-generating LLMs, prioritize integrating modular Chain-of-Thought (CoT) prompting with execution-based feedback and inference scaling techniques. This hybrid approach, especially within agentic frameworks, demonstrably improves performance on complex software engineering tasks like GitHub issue resolution and competitive programming benchmarks. Focus on multilingual reasoning and exploring code-specific plans for agents to address current limitations and enhance generalizability.
Key insights
LLM code reasoning improves through structured CoT, execution feedback, and inference scaling, culminating in agentic systems.
Principles
- Modular CoT outperforms structure-aware and plan-based methods.
- Execution-aware strategies enhance code correctness via deterministic checks.
- Inference scaling with search counteracts model rigidity.
Method
The paper categorizes code reasoning into Code CoT (plan-based, structure-based, fine-tuning), execution-based (self-evaluation, training with feedback, automated test generation), and inference scaling (sampling, search).
In practice
- Implement modular CoT for complex code generation problems.
- Integrate execution feedback loops for self-debugging LLM-generated code.
- Employ tree-search algorithms for exploring diverse code solution paths.
Topics
- Code Reasoning
- Large Language Models
- Software Engineering Agents
- Chain-of-Thought Prompting
- Execution-based Feedback
- Code Generation Benchmarks
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.