ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

2026-03-04 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The Adaptive Tree Policy Optimization (ATPO) algorithm is introduced to enhance information seeking in multi-turn medical dialogues, addressing challenges like long-horizon credit assignment in GRPO and unstable value estimation in PPO. ATPO formulates these interactions as a Hierarchical Markov Decision Process (H-MDP) and adaptively allocates rollout budgets to states with high uncertainty, quantified by Bellman error and action-value variance. This strategy improves value estimation and fosters efficient exploration. To mitigate computational costs, ATPO incorporates uncertainty-guided pruning and an asynchronous search architecture leveraging KV cache reuse. Experiments on MedQA, MedMCQA, and MedicalExam datasets demonstrate ATPO's superior performance, with a Qwen3-8B model trained with ATPO surpassing GPT-4o by 0.92% accuracy on MedQA, and achieving higher sample efficiency and deeper, more balanced exploration compared to baselines like TreePO and GRPO.

Key takeaway

For NLP Engineers developing conversational AI for healthcare, ATPO offers a robust method to improve diagnostic accuracy in multi-turn medical dialogues. You should consider integrating adaptive tree search with uncertainty-aware budget allocation and KV cache reuse to enhance model performance and sample efficiency. This approach can enable smaller models like Qwen3-8B to outperform larger, general-purpose LLMs in specialized interactive scenarios.

Key insights

ATPO uses uncertainty-aware adaptive tree search and efficient execution to optimize multi-turn medical dialogue LLMs.

Principles

Adaptive rollout budget allocation improves value estimation and exploration.
Hierarchical MDPs are effective for multi-turn dialogue modeling.
Uncertainty metrics (Bellman error, action-value variance) guide efficient tree search.

Method

ATPO employs an adaptive tree search, calculating a composite uncertainty metric (Bellman error + action-value variance) for node expansion, with uncertainty-guided pruning and asynchronous execution leveraging KV cache reuse for efficiency.

In practice

Implement uncertainty-guided pruning to reduce computational costs in tree-based RL.
Utilize KV cache reuse and asynchronous execution for improved inference throughput.
Apply hierarchical MDPs for complex, long-horizon multi-turn interactions.

Topics

Adaptive Tree Policy Optimization
Multi-Turn Medical Dialogue
Reinforcement Learning
Large Language Models
Hierarchical MDP

Code references

Quark-Medical/ATPO

Best for: NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.