Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models
Summary
Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a novel framework for optimizing reinforcement learning (RL) efficiency in large language models (LLMs) by addressing problem sampling as a manifold-structured bandit problem. Traditional adaptive curriculum methods often prioritize intermediate difficulty, neglecting the inherent structure and dynamic nature of task spaces. BMC constructs "Latent Task Trees" directly from LLM embeddings, creating a hierarchical representation of problem relationships. It then applies Bayesian learning to guide sampling, balancing learning signal (productivity) with task manifold coverage (diversity). Empirical evaluations on the DAPO-Math-17K dataset, using Qwen3-4B-Base and Qwen3-8B-Base models, demonstrate that BMC achieves learning speeds comparable to "Difficulty Only" methods while significantly improving diversity and performing strongly on out-of-domain benchmarks like GPQA-Diamond. The framework also highlights critical tradeoffs between productivity, diversity, and evaluation relevance (utility), leading to BMC-T, a utility-aware extension that biases sampling toward target-relevant regions.
Key takeaway
For AI Scientists and ML Engineers optimizing LLM reinforcement learning, relying solely on prompt difficulty for curriculum design risks narrow skill development. You should implement structure-aware methods like Bayesian Manifold Curriculum (BMC) to explicitly balance learning signal, task diversity, and evaluation relevance. Consider BMC-T to bias training towards specific target distributions, ensuring your model generalizes effectively across heterogeneous problem types and benchmarks.
Key insights
LLM RL training benefits from structured problem sampling that balances productivity, diversity, and evaluation utility.
Principles
- LLM training efficiency requires informative problem sampling.
- Task spaces have latent geometric structure.
- Curriculum design balances productivity, diversity, utility.
Method
BMC builds Latent Task Trees from policy embeddings, then uses hierarchical Thompson sampling with Bayesian belief updates and empirical Bayes propagation for problem selection.
In practice
- Organize training data using policy embeddings.
- Balance learning signal with task manifold coverage.
- Bias sampling towards target evaluation distributions.
Topics
- LLM Reinforcement Learning
- Curriculum Learning
- Manifold Bandits
- Latent Task Trees
- Bayesian Sampling
- Training Efficiency
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.