References Improve LLM Alignment in Non-Verifiable Domains

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A study by researchers from Yale University, Meta, Scale AI, and Salesforce Research demonstrates that incorporating high-quality reference outputs significantly enhances Large Language Model (LLM) alignment in non-verifiable domains. The research introduces "RefEval" and "RefMatch" evaluation protocols that guide LLM-based evaluators using reference outputs, showing substantial improvements in accuracy for less capable LLM-judges when using references from frontier models like GPT-4o. Stronger LLM-judges, such as GPT-4o, also benefit from human-written references. Building on these improved judges, the study applies reference-guided self-improvement in alignment tuning, where LLMs supervise their own training. This method yields clear gains over direct Supervised Fine-Tuning (SFT) on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with a strong finetuned reward model like ArmoRM. Specifically, the method achieved 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, representing average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement.

Key takeaway

For AI Engineers and Research Scientists focused on LLM alignment in domains without clear ground-truth verifiers, integrating high-quality reference outputs into both evaluation and training pipelines is crucial. Your teams should adopt explicit reference-guided prompting strategies like RefEval for LLM-as-a-Judge evaluations to boost accuracy and consistency. Furthermore, consider a two-stage training approach: initial SFT on high-quality references, followed by DPO with reference-guided self-judges, to achieve alignment performance comparable to or exceeding traditional reward models without extensive human or AI feedback.

Key insights

Reference-guided LLM evaluators significantly improve model alignment and self-improvement in non-verifiable domains.

Principles

Method

The method involves two stages: first, Supervised Fine-Tuning (SFT) on high-quality reference outputs, followed by Direct Preference Optimization (DPO) using reference-guided LLMs as self-judges to construct preference annotations.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.