Unlocking LLM Code Correction with Iterative Feedback Loops
Summary
This study systematically investigates Large Language Models' (LLMs) ability to self-correct code using iterative execution feedback. Researchers evaluated four state-of-the-art LLMs (DeepSeek-R1, DeepSeek-V3, GPT-o4-mini, GPT-4.1-mini) across two programming languages (Python, Java) on real-world LeetCode problems. The iterative refinement framework provided compiler errors and testcase feedback over up to 10 iterations. Results show reasoning models like DeepSeek-R1 and GPT-o4-mini consistently improve, significantly outperforming non-reasoning models in leveraging feedback. Syntactic and runtime errors were far more tractable (fix rates >80%) than logical or algorithmic failures (fix rates <35%), revealing LLMs' current limitations in deep algorithmic reasoning. The study introduces new metrics like Iterative Success Rate (ISR@k) and Median Iterations to Solve (MIS) for a more realistic evaluation.
Key takeaway
For Machine Learning Engineers developing LLM-driven code generation systems, you should integrate iterative feedback loops to significantly improve code correctness beyond single-attempt performance. Focus on providing clear execution feedback for syntactic and runtime errors, as these are most amenable to LLM self-correction. Be aware that deep algorithmic or logical errors remain challenging, requiring alternative strategies or human intervention. Consider using metrics like ISR@k and MIS for a more comprehensive evaluation of your models' real-world utility.
Key insights
Iterative feedback loops significantly enhance LLM code correction, especially for reasoning models and specific error types.
Principles
- Reasoning capacity improves feedback utilization.
- Explicit prompt guidance enhances code efficiency.
- Error type dictates correction tractability.
Method
An automated feedback loop executes LLM-generated code, constructs prompts with failure messages, and provides this execution feedback to the LLM for iterative refinement over up to 10 turns.
In practice
- Implement multi-turn feedback for LLM code generation.
- Prioritize fixing syntactic and runtime errors first.
- Use ISR@k and MIS for robust LLM evaluation.
Topics
- Large Language Models
- Code Generation
- Iterative Refinement
- Feedback Loops
- Model Evaluation
- Algorithmic Optimization
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.