Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling
Summary
The Surprise-Guided MergeSort (SGS) framework introduces a budget-efficient human-in-the-loop ranking method by leveraging Vision-Language Models (VLMs) as question prioritizers. Instead of replacing human annotators, SGS identifies comparisons genuinely requiring human judgment. It integrates a bottom-up MergeSort scheduler, a composite Surprise Scorer combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy, and an adaptive budget allocator. Validated on six diverse benchmarks spanning text similarity and image quality assessment, SGS effectively identified and skipped up to 535 non-informative comparisons per session. This led to Kendall's τ×100 improvements of +6 to +12 over Active Elo under the same total budget, demonstrating consistent accuracy-efficiency across diverse domains.
Key takeaway
For Machine Learning Engineers designing human-in-the-loop ranking systems, you should consider implementing the Surprise-Guided MergeSort (SGS) framework. This approach can significantly reduce human annotation costs by intelligently prioritizing comparisons, as demonstrated by skipping up to 535 non-informative comparisons and achieving +6 to +12 Kendall's τ×100 improvements. Evaluate SGS for your subjective ranking tasks to optimize budget efficiency without sacrificing accuracy.
Key insights
Surprise-Guided MergeSort uses VLMs to prioritize human comparisons, significantly improving ranking annotation efficiency.
Principles
- Pairwise comparison is the gold standard for subjective ranking tasks.
- Sorting-based methods reduce comparison burden to O(n log n).
- VLMs can prioritize human judgment, not just replace it.
Method
SGS integrates a bottom-up MergeSort scheduler, a composite Surprise Scorer (VLM confidence, Elo gap, vote entropy), and an adaptive budget allocator to route high-surprise pairs to humans.
In practice
- Identify non-informative comparisons using VLM-guided surprise metrics.
- Combine VLM confidence, Elo gap, and vote entropy for ambiguity scoring.
- Apply MergeSort scheduling to exploit transitivity in ranking tasks.
Topics
- Human-in-the-Loop
- Pairwise Comparison
- Ranking Algorithms
- Vision-Language Models
- MergeSort
- Annotation Efficiency
Best for: NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.