Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

2026-05-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

PyRAG is a novel framework that redefines multi-hop Retrieval-Augmented Generation (RAG) by treating it as a program synthesis and execution task. Unlike traditional RAG systems that rely on free-form natural language reasoning, PyRAG generates executable Python programs to manage multi-hop questions, explicitly exposing intermediate states as variables and providing deterministic feedback through execution. This approach enables compiler-grounded self-repair and execution-driven adaptive retrieval without additional training. Experiments across five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) demonstrate that PyRAG consistently outperforms strong baselines, showing significant gains on compositional multi-hop datasets. The framework uses a Qwen2.5-7B-Instruct backbone and an E5-base dense retriever over Wikipedia 2018, achieving an average Exact Match (EM) of 30.8 in training-free settings and 39.2 with RL-trained variants.

Key takeaway

For AI Architects and Research Scientists designing advanced RAG systems, PyRAG offers a robust paradigm shift. By adopting a program-guided approach, you can significantly improve the accuracy and interpretability of multi-hop question answering, especially for complex compositional queries. Consider integrating executable planning to gain explicit state management, deterministic error detection, and enhanced debugging capabilities, moving beyond implicit natural language reasoning to more verifiable and controllable AI systems.

Key insights

Reformulating multi-hop RAG as program synthesis and execution enhances verifiability and performance.

Principles

Multi-hop QA is a step-by-step computation.
Code-specialized models excel with program synthesis interfaces.
Execution provides deterministic feedback for self-repair.

Method

PyRAG decomposes questions into sub-queries, plans an executable Python program using `retrieve(query)` and `answer(query, docs)` APIs, and executes it with compiler-grounded self-repair and adaptive retrieval.

In practice

Use Python programs for complex RAG workflows.
Implement explicit variable binding for intermediate results.
Incorporate runtime error handling for program self-correction.

Topics

Retrieval-Augmented Generation
Multi-hop Question Answering
Program Synthesis
Executable Reasoning
Code-specialized LLMs

Code references

GasolSun36/PyRAG

Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.