Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

2026-05-04 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a novel framework for optimizing reinforcement learning (RL) efficiency in large language models (LLMs) by addressing problem sampling as a manifold-structured bandit problem. Traditional adaptive curriculum methods often prioritize intermediate difficulty, neglecting the inherent structure and dynamic nature of task spaces. BMC constructs "Latent Task Trees" directly from LLM embeddings, creating a hierarchical representation of problem relationships. It then applies Bayesian learning to guide sampling, balancing learning signal (productivity) with task manifold coverage (diversity). Empirical evaluations on the DAPO-Math-17K dataset, using Qwen3-4B-Base and Qwen3-8B-Base models, demonstrate that BMC achieves learning speeds comparable to "Difficulty Only" methods while significantly improving diversity and performing strongly on out-of-domain benchmarks like GPQA-Diamond. The framework also highlights critical tradeoffs between productivity, diversity, and evaluation relevance (utility), leading to BMC-T, a utility-aware extension that biases sampling toward target-relevant regions.

Key takeaway

For AI Scientists and ML Engineers optimizing LLM reinforcement learning, relying solely on prompt difficulty for curriculum design risks narrow skill development. You should implement structure-aware methods like Bayesian Manifold Curriculum (BMC) to explicitly balance learning signal, task diversity, and evaluation relevance. Consider BMC-T to bias training towards specific target distributions, ensuring your model generalizes effectively across heterogeneous problem types and benchmarks.

Key insights

LLM RL training benefits from structured problem sampling that balances productivity, diversity, and evaluation utility.

Principles

LLM training efficiency requires informative problem sampling.
Task spaces have latent geometric structure.
Curriculum design balances productivity, diversity, utility.

Method

BMC builds Latent Task Trees from policy embeddings, then uses hierarchical Thompson sampling with Bayesian belief updates and empirical Bayes propagation for problem selection.

In practice

Organize training data using policy embeddings.
Balance learning signal with task manifold coverage.
Bias sampling towards target evaluation distributions.

Topics

LLM Reinforcement Learning
Curriculum Learning
Manifold Bandits
Latent Task Trees
Bayesian Sampling
Training Efficiency

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.