ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
Summary
The Adaptive Tree Policy Optimization (ATPO) algorithm is introduced to enhance information seeking in multi-turn medical dialogues, addressing challenges like long-horizon credit assignment in GRPO and unstable value estimation in PPO. ATPO formulates these interactions as a Hierarchical Markov Decision Process (H-MDP) and adaptively allocates rollout budgets to states with high uncertainty, quantified by Bellman error and action-value variance. This strategy improves value estimation and fosters efficient exploration. To mitigate computational costs, ATPO incorporates uncertainty-guided pruning and an asynchronous search architecture leveraging KV cache reuse. Experiments on MedQA, MedMCQA, and MedicalExam datasets demonstrate ATPO's superior performance, with a Qwen3-8B model trained with ATPO surpassing GPT-4o by 0.92% accuracy on MedQA, and achieving higher sample efficiency and deeper, more balanced exploration compared to baselines like TreePO and GRPO.
Key takeaway
For NLP Engineers developing conversational AI for healthcare, ATPO offers a robust method to improve diagnostic accuracy in multi-turn medical dialogues. You should consider integrating adaptive tree search with uncertainty-aware budget allocation and KV cache reuse to enhance model performance and sample efficiency. This approach can enable smaller models like Qwen3-8B to outperform larger, general-purpose LLMs in specialized interactive scenarios.
Key insights
ATPO uses uncertainty-aware adaptive tree search and efficient execution to optimize multi-turn medical dialogue LLMs.
Principles
- Adaptive rollout budget allocation improves value estimation and exploration.
- Hierarchical MDPs are effective for multi-turn dialogue modeling.
- Uncertainty metrics (Bellman error, action-value variance) guide efficient tree search.
Method
ATPO employs an adaptive tree search, calculating a composite uncertainty metric (Bellman error + action-value variance) for node expansion, with uncertainty-guided pruning and asynchronous execution leveraging KV cache reuse for efficiency.
In practice
- Implement uncertainty-guided pruning to reduce computational costs in tree-based RL.
- Utilize KV cache reuse and asynchronous execution for improved inference throughput.
- Apply hierarchical MDPs for complex, long-horizon multi-turn interactions.
Topics
- Adaptive Tree Policy Optimization
- Multi-Turn Medical Dialogue
- Reinforcement Learning
- Large Language Models
- Hierarchical MDP
Code references
Best for: NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.