OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

2026-06-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

OmniOPSD is a novel Rationale-Privileged On-Policy Self-Distillation framework designed to overcome reward sparsity in reinforcement learning for multimodal large language models (MLLMs), particularly in complex human-centered affective computing tasks. This framework addresses the high cost and difficulty of obtaining high-quality chain-of-thought (CoT) annotations. OmniOPSD utilizes frontier-generated rationales as privileged evidence for a local teacher during training, rather than as direct imitation targets for the student model. The student generates its own trajectory from multimodal input, while the teacher provides dense, token-level supervision on the same tokens. This approach enables the student to learn effectively on its own trajectory distribution. Notably, inference with OmniOPSD does not require labels, rationales, CoT annotations, or access to closed-source models. It achieved state-of-the-art performance on MER-UniBench with an average score of 84.19.

Key takeaway

For Machine Learning Engineers developing multimodal large language models for affective computing, OmniOPSD offers a robust solution to reward sparsity and annotation costs. You should consider this self-distillation framework to train models that learn from their own trajectories, leveraging rationales as privileged evidence rather than direct imitation. This approach eliminates the need for expensive chain-of-thought annotations and closed-source model access during inference, streamlining deployment.

Key insights

OmniOPSD uses rationale-privileged self-distillation to enable MLLMs to learn complex affective tasks without direct imitation or inference-time rationales.

Principles

Rationales serve as privileged evidence, not imitation targets.
Student models learn from their own trajectory distributions.
Dense token-level supervision enhances learning efficiency.

Method

OmniOPSD employs a local teacher guided by frontier-generated rationales as privileged evidence. The student samples its own multimodal input rollout, receiving dense token-level supervision from the teacher on the same tokens.

In practice

Train MLLMs for human-centered affective computing.
Reduce reliance on expensive CoT annotations.
Deploy models without external rationale access.

Topics

OmniOPSD
Self-Distillation
Affective Computing
Multimodal LLMs
Reinforcement Learning
Reward Sparsity

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.