ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The Adaptive Tree Policy Optimization (ATPO) algorithm is introduced to enhance information seeking in multi-turn medical dialogues, addressing challenges like long-horizon credit assignment in GRPO and unstable value estimation in PPO. ATPO formulates these interactions as a Hierarchical Markov Decision Process (H-MDP) and adaptively allocates rollout budgets to states with high uncertainty, quantified by Bellman error and action-value variance. This strategy improves value estimation and fosters efficient exploration. To mitigate computational costs, ATPO incorporates uncertainty-guided pruning and an asynchronous search architecture leveraging KV cache reuse. Experiments on MedQA, MedMCQA, and MedicalExam datasets demonstrate ATPO's superior performance, with a Qwen3-8B model trained with ATPO surpassing GPT-4o by 0.92% accuracy on MedQA, and achieving higher sample efficiency and deeper, more balanced exploration compared to baselines like TreePO and GRPO.

Key takeaway

For NLP Engineers developing conversational AI for healthcare, ATPO offers a robust method to improve diagnostic accuracy in multi-turn medical dialogues. You should consider integrating adaptive tree search with uncertainty-aware budget allocation and KV cache reuse to enhance model performance and sample efficiency. This approach can enable smaller models like Qwen3-8B to outperform larger, general-purpose LLMs in specialized interactive scenarios.

Key insights

ATPO uses uncertainty-aware adaptive tree search and efficient execution to optimize multi-turn medical dialogue LLMs.

Principles

Method

ATPO employs an adaptive tree search, calculating a composite uncertainty metric (Bellman error + action-value variance) for node expansion, with uncertainty-guided pruning and asynchronous execution leveraging KV cache reuse for efficiency.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.