PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization
Summary
PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking) is a novel framework designed to address "parse collapse" in generative listwise ranking using Large Multimodal Models (LMMs). This failure mode, where LMMs silently omit candidates or terminate early in long-context multimodal scenarios, significantly degrades ranking effectiveness. PRISMR replaces transient in-context list processing with parametric structural conditioning, employing a lightweight hypernetwork to encode multimodal candidates in parallel. This hypernetwork generates item-specific LoRA weights, which are then synthesized into an instance-specific adapter for the LMM. Evaluated on a large-scale multimodal review-ranking benchmark derived from Amazon Reviews 2023, PRISMR substantially reduces parse collapse, achieving near-perfect parse rates. It also establishes new state-of-the-art NDCG@10 performance, generalizes effectively beyond training list lengths (up to N=100), and transfers across different domains like Baby_Products and Amazon_Fashion without additional training. Furthermore, PRISMR demonstrates improved inference efficiency compared to constrained decoding, with speed-ups increasing from 1.1× at N=10 to 1.7× at N=50 on a single NVIDIA B200 GPU.
Key takeaway
For Machine Learning Engineers deploying Large Multimodal Models for listwise ranking, particularly with long candidate lists, you should consider PRISMR to mitigate "parse collapse" and improve ranking quality. Its hypernetwork-based parametric conditioning offers a robust alternative to traditional prompt engineering, ensuring higher parse rates and better NDCG@10. Implement its adaptive α/β-mode synthesis to maintain performance across varying list lengths, from N ≤ 50 to N > 50, while also gaining inference efficiency.
Key insights
Parse collapse in LMM listwise ranking is overcome by parametric structural conditioning via hypernetwork-generated LoRA adapters.
Principles
- Generative LMMs suffer "parse collapse" in long multimodal contexts.
- Parametric conditioning via hypernetworks enhances LMM listwise ranking.
- Adaptive LoRA synthesis modes balance capacity and length robustness.
Method
PRISMR uses a hypernetwork to encode each multimodal candidate into item-specific LoRA adapters. These N adapters are synthesized into a single composite weight increment (α-mode for N ≤ 50, β-mode for N > 50) applied to a frozen LMM for decoding.
In practice
- Use hypernetwork-generated LoRA for robust multimodal listwise ranking.
- Employ α-mode for in-distribution list lengths (N ≤ 50).
- Switch to β-mode for length extrapolation (N > 50) for stability.
Topics
- Large Multimodal Models
- Listwise Ranking
- Parse Collapse
- Hypernetworks
- LoRA Adapters
- Multimodal Benchmarks
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.