OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing
Summary
OmniOPSD is a novel Rationale-Privileged On-Policy Self-Distillation framework designed to overcome reward sparsity in reinforcement learning for multimodal large language models (MLLMs), particularly in complex human-centered affective computing tasks. This framework addresses the high cost and difficulty of obtaining high-quality chain-of-thought (CoT) annotations. OmniOPSD utilizes frontier-generated rationales as privileged evidence for a local teacher during training, rather than as direct imitation targets for the student model. The student generates its own trajectory from multimodal input, while the teacher provides dense, token-level supervision on the same tokens. This approach enables the student to learn effectively on its own trajectory distribution. Notably, inference with OmniOPSD does not require labels, rationales, CoT annotations, or access to closed-source models. It achieved state-of-the-art performance on MER-UniBench with an average score of 84.19.
Key takeaway
For Machine Learning Engineers developing multimodal large language models for affective computing, OmniOPSD offers a robust solution to reward sparsity and annotation costs. You should consider this self-distillation framework to train models that learn from their own trajectories, leveraging rationales as privileged evidence rather than direct imitation. This approach eliminates the need for expensive chain-of-thought annotations and closed-source model access during inference, streamlining deployment.
Key insights
OmniOPSD uses rationale-privileged self-distillation to enable MLLMs to learn complex affective tasks without direct imitation or inference-time rationales.
Principles
- Rationales serve as privileged evidence, not imitation targets.
- Student models learn from their own trajectory distributions.
- Dense token-level supervision enhances learning efficiency.
Method
OmniOPSD employs a local teacher guided by frontier-generated rationales as privileged evidence. The student samples its own multimodal input rollout, receiving dense token-level supervision from the teacher on the same tokens.
In practice
- Train MLLMs for human-centered affective computing.
- Reduce reliance on expensive CoT annotations.
- Deploy models without external rationale access.
Topics
- OmniOPSD
- Self-Distillation
- Affective Computing
- Multimodal LLMs
- Reinforcement Learning
- Reward Sparsity
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.