FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

FAPO (Fully Autonomous Prompt Optimization) is a framework that utilizes Claude Code to optimize multi-step LLM pipelines by evaluating performance, inspecting intermediate steps, diagnosing failures, and proposing iterative changes. It prioritizes prompt edits but escalates to structural modifications when attribution identifies deeper bottlenecks. Across six benchmarks and three task models, FAPO surpassed the GEPA baseline in 15 of 18 comparisons, achieving a mean gain of +14.1 pp. Notably, on HoVer and IFBench, where structural changes were implemented, FAPO delivered a mean gain of +33.8 pp. For security tasks like CTIBench-RCM, prompt-only FAPO boosted test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning.

Key takeaway

For AI Engineers building multi-step LLM pipelines, FAPO provides a critical framework for autonomous optimization that extends beyond traditional prompt tuning. You should consider integrating FAPO's evidence-grounded, prompt-first approach to diagnose and resolve pipeline bottlenecks, including structural issues. This can significantly improve performance on complex tasks like multi-hop QA and security classification, ensuring more reliable and efficient LLM-powered applications.

Key insights

FAPO autonomously optimizes multi-step LLM pipelines by diagnosing failures and iteratively applying prompt or structural changes.

Principles

Separate shared tester from task logic.
Ground decisions in recorded evidence.
Prefer smallest useful change (prompt-first).

Method

FAPO evaluates a pipeline, records step-level evidence, attributes failures, proposes a scoped variant (prompt or structural), reviews it, and iterates if performance improves or escalates if prompt edits are insufficient.

In practice

Optimize multi-hop QA pipelines.
Improve security CVE-to-CWE classification.
Enhance fact-verification and instruction-following.

Topics

Multi-step LLM Pipelines
Prompt Optimization
Autonomous Agents
Claude Code
Pipeline Optimization
Failure Attribution
Security Classification

Code references

Best for: Research Scientist, AI Architect, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.