AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Large Language Models · Depth: Expert, extended

Summary

AdaJudge is a novel framework designed to enhance reward modeling for aligning large language models with human preferences. It addresses two key limitations of current architectures: static pooling strategies that misalign with task-dependent preference signals and representational mismatches from backbones optimized for generation. AdaJudge implements a two-stage adaptive process, first refining backbone representations into a discrimination-oriented space using gated refinement blocks. Second, it replaces static readouts with an adaptive multi-view pooling module, dynamically routing and combining evidence from last-token, mean, and attention pooling experts. Experiments on RM-Bench and JudgeBench demonstrate AdaJudge's superior performance, with Qwen3-8B achieving 71.1 on RM-Bench and 66.0 on JudgeBench, outperforming Skywork-Reward-Llama-3.1-8B by 1.0 and 3.7 points, respectively. It also significantly stabilizes performance for smaller models like Phi-3.5-mini-instruct, recovering from 34.4 to 61.5 on RM-Bench.

Key takeaway

For ML Engineers and AI Scientists developing reward models, relying on static pooling and generative-optimized backbones can limit performance and stability, especially on diverse or complex evaluation tasks. You should consider adopting adaptive frameworks like AdaJudge, which jointly refine representations and dynamically aggregate evidence. This approach can significantly improve accuracy and prevent representation collapse in smaller models, ensuring more robust and task-aligned human preference learning for your LLMs.

Key insights

Adaptive representation refinement and multi-perspective aggregation are crucial for robust reward modeling across diverse LLM evaluation tasks.

Principles

Fixed pooling strategies create an inductive bias mismatch across heterogeneous evaluation tasks.
Generative LLM backbones are suboptimal for fine-grained preference discrimination.
Dynamic routing of pooling experts improves performance on complex, reasoning-intensive tasks.

Method

AdaJudge refines backbone hidden states via K depth-gated attention blocks, then dynamically aggregates evidence using a prompt-conditioned gating network that combines last-token, mean, and attention pooling experts.

In practice

Implement depth-gated refinement blocks to transform backbone representations for discrimination.
Utilize a mixture-of-pooling experts with dynamic, prompt-conditioned routing for aggregation.
Train with Focal Bradley–Terry loss and entropy regularization for stable convergence.

Topics

Reward Modeling
LLM Alignment
Adaptive Pooling
Representation Learning
Multi-Perspective Aggregation
Transformer Architectures

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.