CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

CMI-RewardBench introduces a comprehensive ecosystem for evaluating music reward models under Compositional Multimodal Instruction (CMI), addressing the gap in current evaluation mechanisms for advanced music generation. This work presents CMI-Pref-Pseudo, a large-scale preference dataset with 110k pseudo-labeled samples, and CMI-Pref, a high-quality human-annotated corpus of 4,027 pairs. CMI-RewardBench unifies evaluation across musicality, text-music alignment, and compositional instruction alignment. The developed CMI reward models (CMI-RMs), a parameter-efficient family of ~30M parameters, process heterogeneous inputs and demonstrate strong correlation with human judgments, outperforming general-purpose MLLMs like Gemini 3 Pro (65.80% accuracy) and Qwen3-Omni (60.40% accuracy) on CMI-Pref. The resources, including training data, benchmarks, and models, are publicly available.

Key takeaway

For Machine Learning Engineers developing music generation systems, you should integrate CMI-RewardBench and CMI-RM into your evaluation and post-training pipelines. This will enable you to assess generated music against complex compositional instructions, ensuring better alignment with human preferences for musicality and instruction-following. Leveraging the provided datasets and models can significantly improve your system's output quality through effective inference-time scaling via top-$k$ filtering.

Key insights

Music reward models require compositional multimodal instruction evaluation to align with complex human preferences.

Principles

High-quality human preference data drives cross-benchmark generalization.
Pseudo-label pre-training establishes a robust prior, mitigating overfitting.
General-purpose MLLMs struggle with fine-grained music assessment.

Method

CMI-RM employs a two-tower multimodal architecture with frozen MuQ-MuLan encoders. It's trained in two stages: pseudo-label pre-training with label smoothing, then expert fine-tuning on human annotations using Bradley–Terry and MSE loss.

In practice

Use CMI-RM for "best-of-N" filtering to improve music generation quality.
Apply positional-consistency strategy for robust pseudo-label generation.
Incorporate prompt conditions to enhance musicality prediction accuracy.

Topics

Music Generation
Reward Models
Multimodal AI
Human Feedback
Evaluation Benchmarks
CMI-RewardBench
CMI-RM

Code references

Haiwen-Xia/CMI-RewardBench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.