Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
Summary
Reward Weighted Classifier-Free Guidance (RCFG) is introduced as a policy improvement operator for autoregressive models, enabling test-time optimization of arbitrary reward functions without retraining. This method generalizes classifier-free guidance to handle non-binary, multi-attribute reward functions, approximating the tilting of a sampling distribution by the Q-function. The researchers applied RCFG to molecular generation, demonstrating its ability to optimize novel reward functions at inference time, achieving reward gains comparable to extensive reinforcement learning (RL) while preserving generation diversity. Furthermore, RCFG can serve as a "privileged information" teacher for self-distillation, significantly accelerating the convergence of standard RL training when used as a warm start. The approach was validated using a Qwen3-0.6B-Base model fine-tuned on SMILES strings and 25 molecular properties, optimizing 24 distinct reward functions.
Key takeaway
For machine learning engineers developing generative models that require dynamic alignment to evolving reward functions, RCFG offers a powerful inference-time solution. You can avoid costly retraining cycles by applying RCFG to optimize novel reward functions on the fly, particularly useful in domains like drug discovery where property tradeoffs frequently change. Consider using RCFG-distilled policies as warm starts to significantly accelerate your reinforcement learning pipelines, improving efficiency and reducing computational overhead.
Key insights
RCFG enables test-time optimization of autoregressive models for arbitrary reward functions without retraining.
Principles
- RCFG approximates Q-function-based policy improvement.
- Reward functions can change without requiring model retraining.
- Distillation from RCFG can warm-start RL training.
Method
Train a conditional autoregressive model on (x,y) pairs. At inference, sample a guidance set Y_S and modify next-token logits using a reward-weighted sum of importance ratios, approximating Q-function tilting.
In practice
- Optimize molecular properties at inference time.
- Speed up RL convergence with RCFG-distilled warm starts.
- Handle complex, multi-objective reward tradeoffs.
Topics
- Reward Weighted Classifier-Free Guidance
- Autoregressive Models
- Policy Improvement
- Molecular Generation
- Reinforcement Learning
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.