The Necessity of Setting Temperature in LLM-as-a-Judge

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A systematic investigation reveals the critical impact of temperature settings on LLM-as-a-Judge performance, challenging the empirical convention of fixed temperature choices. Researchers from the University of Luxembourg and ETH Zürich conducted over 180,000 inference calls per model across six temperature settings ([0.01, 0.5, 1.0, 1.5, 2.0, 3.0]) using models like Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 and google/gemma-3-27b-it. Findings show low temperatures (T=0.01) result in near-perfect consistency (≈ 1.00) and negligible error rates (≈ 0.00). Conversely, high temperatures (T=3.0) drastically reduce consistency (e.g., Qwen3-Next-80B to 0.57) and significantly increase error rates (up to 0.49). Causal analysis confirms temperature structurally influences judge behavior, primarily undermining stability, reproducibility, and increasing format parsing errors, rather than directly impacting accuracy. High temperatures can foster deeper reasoning but also lead to instruction non-compliance and indecisiveness.

Key takeaway

For Machine Learning Engineers configuring LLM-as-a-Judge systems, your temperature setting is not a trivial detail but a core determinant of evaluation reliability. You should move beyond fixed low-temperature defaults and adopt task-adaptive strategies. For high-stakes, deterministic evaluations, prioritize low temperatures to ensure consistency and minimize errors. However, if your goal is deeper, more human-like reasoning, use higher temperatures, potentially combining a high-temperature agent for initial judgment with a low-temperature agent for output correction.

Key insights

LLM-as-a-Judge performance is critically sensitive to temperature, affecting consistency, error rates, and reasoning depth.

Principles

Method

A causal inference framework, using Cross-Fitted AIPW with LightGBM regressors, rigorously examines temperature's direct effect on LLM judge behavior.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.