Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Life Sciences & Biology · Depth: Expert, extended

Summary

Reward Weighted Classifier-Free Guidance (RCFG) is introduced as a policy improvement operator for autoregressive models, enabling test-time optimization of arbitrary reward functions without retraining. This method generalizes classifier-free guidance to handle non-binary, multi-attribute reward functions, approximating the tilting of a sampling distribution by the Q-function. The researchers applied RCFG to molecular generation, demonstrating its ability to optimize novel reward functions at inference time, achieving reward gains comparable to extensive reinforcement learning (RL) while preserving generation diversity. Furthermore, RCFG can serve as a "privileged information" teacher for self-distillation, significantly accelerating the convergence of standard RL training when used as a warm start. The approach was validated using a Qwen3-0.6B-Base model fine-tuned on SMILES strings and 25 molecular properties, optimizing 24 distinct reward functions.

Key takeaway

For machine learning engineers developing generative models that require dynamic alignment to evolving reward functions, RCFG offers a powerful inference-time solution. You can avoid costly retraining cycles by applying RCFG to optimize novel reward functions on the fly, particularly useful in domains like drug discovery where property tradeoffs frequently change. Consider using RCFG-distilled policies as warm starts to significantly accelerate your reinforcement learning pipelines, improving efficiency and reducing computational overhead.

Key insights

RCFG enables test-time optimization of autoregressive models for arbitrary reward functions without retraining.

Principles

Method

Train a conditional autoregressive model on (x,y) pairs. At inference, sample a guidance set Y_S and modify next-token logits using a reward-weighted sum of importance ratios, approximating Q-function tilting.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.