Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
Summary
A new technique called Reward Weighted Classifier-Free Guidance (RCFG) has been developed for autoregressive models, which generate outputs like answers or molecules. These outputs are characterized by attribute vectors, such as helpfulness/harmlessness or bio-availability/lipophilicity, with an arbitrary reward function r(y) encoding property tradeoffs. Unlike traditional reinforcement learning, which requires re-training when reward functions change, RCFG acts as a policy improvement operator at test time, approximating the tilting of the sampling distribution by the Q function. The method was successfully applied to molecular generation, demonstrating its ability to optimize novel reward functions without re-training. Furthermore, using RCFG as a teacher to distill into the base policy significantly accelerates convergence for standard reinforcement learning.
Key takeaway
For research scientists developing autoregressive models, RCFG offers a powerful alternative to traditional reinforcement learning for adapting to changing reward functions. You can use RCFG to dynamically optimize model outputs at test time, such as in molecular design, without the need for costly re-training. Additionally, consider using RCFG as a teacher to distill knowledge into your base policy, which can significantly accelerate the convergence of your standard RL training processes.
Key insights
RCFG enables autoregressive models to optimize new reward functions at test time without re-training.
Principles
- Policy improvement can occur at test time.
- Distillation from RCFG speeds up RL convergence.
Method
RCFG approximates tilting an autoregressive model's sampling distribution by the Q function to optimize arbitrary reward functions r(y) at test time, avoiding re-training.
In practice
- Optimize molecular generation for new properties.
- Warm-start RL training with RCFG distillation.
Topics
- Autoregressive Models
- Reward Weighted Classifier-Free Guidance
- Policy Improvement
- Reinforcement Learning
- Molecular Generation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.