CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Summary
CMI-RewardBench introduces a comprehensive ecosystem for evaluating music reward models under Compositional Multimodal Instruction (CMI), addressing the gap in current evaluation mechanisms for advanced music generation. This work presents CMI-Pref-Pseudo, a large-scale preference dataset with 110k pseudo-labeled samples, and CMI-Pref, a high-quality human-annotated corpus of 4,027 pairs. CMI-RewardBench unifies evaluation across musicality, text-music alignment, and compositional instruction alignment. The developed CMI reward models (CMI-RMs), a parameter-efficient family of ~30M parameters, process heterogeneous inputs and demonstrate strong correlation with human judgments, outperforming general-purpose MLLMs like Gemini 3 Pro (65.80% accuracy) and Qwen3-Omni (60.40% accuracy) on CMI-Pref. The resources, including training data, benchmarks, and models, are publicly available.
Key takeaway
For Machine Learning Engineers developing music generation systems, you should integrate CMI-RewardBench and CMI-RM into your evaluation and post-training pipelines. This will enable you to assess generated music against complex compositional instructions, ensuring better alignment with human preferences for musicality and instruction-following. Leveraging the provided datasets and models can significantly improve your system's output quality through effective inference-time scaling via top-$k$ filtering.
Key insights
Music reward models require compositional multimodal instruction evaluation to align with complex human preferences.
Principles
- High-quality human preference data drives cross-benchmark generalization.
- Pseudo-label pre-training establishes a robust prior, mitigating overfitting.
- General-purpose MLLMs struggle with fine-grained music assessment.
Method
CMI-RM employs a two-tower multimodal architecture with frozen MuQ-MuLan encoders. It's trained in two stages: pseudo-label pre-training with label smoothing, then expert fine-tuning on human annotations using Bradley–Terry and MSE loss.
In practice
- Use CMI-RM for "best-of-N" filtering to improve music generation quality.
- Apply positional-consistency strategy for robust pseudo-label generation.
- Incorporate prompt conditions to enhance musicality prediction accuracy.
Topics
- Music Generation
- Reward Models
- Multimodal AI
- Human Feedback
- Evaluation Benchmarks
- CMI-RewardBench
- CMI-RM
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.