Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
Summary
PyRAG is a novel framework that redefines multi-hop Retrieval-Augmented Generation (RAG) by treating it as a program synthesis and execution task. Unlike traditional RAG systems that rely on free-form natural language reasoning, PyRAG generates executable Python programs to manage multi-hop questions, explicitly exposing intermediate states as variables and providing deterministic feedback through execution. This approach enables compiler-grounded self-repair and execution-driven adaptive retrieval without additional training. Experiments across five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) demonstrate that PyRAG consistently outperforms strong baselines, showing significant gains on compositional multi-hop datasets. The framework uses a Qwen2.5-7B-Instruct backbone and an E5-base dense retriever over Wikipedia 2018, achieving an average Exact Match (EM) of 30.8 in training-free settings and 39.2 with RL-trained variants.
Key takeaway
For AI Architects and Research Scientists designing advanced RAG systems, PyRAG offers a robust paradigm shift. By adopting a program-guided approach, you can significantly improve the accuracy and interpretability of multi-hop question answering, especially for complex compositional queries. Consider integrating executable planning to gain explicit state management, deterministic error detection, and enhanced debugging capabilities, moving beyond implicit natural language reasoning to more verifiable and controllable AI systems.
Key insights
Reformulating multi-hop RAG as program synthesis and execution enhances verifiability and performance.
Principles
- Multi-hop QA is a step-by-step computation.
- Code-specialized models excel with program synthesis interfaces.
- Execution provides deterministic feedback for self-repair.
Method
PyRAG decomposes questions into sub-queries, plans an executable Python program using `retrieve(query)` and `answer(query, docs)` APIs, and executes it with compiler-grounded self-repair and adaptive retrieval.
In practice
- Use Python programs for complex RAG workflows.
- Implement explicit variable binding for intermediate results.
- Incorporate runtime error handling for program self-correction.
Topics
- Retrieval-Augmented Generation
- Multi-hop Question Answering
- Program Synthesis
- Executable Reasoning
- Code-specialized LLMs
Code references
Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.