Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk
Summary
The paper introduces Nested Contextual Causal Bandits (NCCBs), a problem class formalizing critical sequential decisions where strategic choices causally shape subsequent tactical ones within a hierarchical Structural Causal Model (SCM). To address this, Nested Causal Thompson Sampling (NCTS) is proposed, which operates by drawing one mechanism-factorised belief per episode and acting recursively. A key theoretical contribution is a causal PAC-Bayesian excess-risk bound, enabling off-policy and anytime certification of any candidate deployment policy from historical data. Experiments demonstrate NCTS's factorised SCM-mechanism posterior achieves significantly better zero-shot transfer under exogenous distribution shifts compared to RFF-GP joint regression. The recursive meta-to-inner commit also dominates joint-commit alternatives, and the certificate contracts as offline data accumulates. These findings support "progressive certified handover," a safe-deployment method allowing each timescale to independently switch from a legacy controller to NCTS upon certified gains.
Key takeaway
For AI Scientists designing sequential decision systems with hierarchical causal dependencies, Nested Causal Thompson Sampling (NCTS) offers a robust approach, providing certified policy optimization and improved zero-shot transfer under exogenous distribution shifts. You should consider implementing its progressive certified handover method to safely transition from legacy controllers, ensuring verifiable performance gains at each timescale independently. This mitigates deployment risk in complex, multi-timescale environments.
Key insights
Nested Causal Thompson Sampling (NCTS) offers certified policy optimization for hierarchical causal decision-making under distribution shifts.
Principles
- Causal coupling between timescales requires specialized bandit theory.
- Factorised SCM-mechanism posteriors improve zero-shot transfer.
- PAC-Bayesian bounds enable off-policy policy deployment certification.
Method
NCTS draws one mechanism-factorised belief per episode and acts recursively. Progressive certified handover allows independent timescale transitions from legacy controllers to NCTS when gains are certifiable.
In practice
- Apply NCTS to sequential decisions with hierarchical causal dependencies.
- Utilize certified handover for safe, incremental system transitions.
Topics
- Nested Causal Bandits
- Thompson Sampling
- PAC-Bayes Risk
- Causal Inference
- Policy Optimization
- Distribution Shift
- Sequential Decision Making
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.