Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

2026-06-24 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on LLM-as-judge safety evaluations reveals that setting the sampling temperature to 0 does not guarantee deterministic grading, challenging a widespread assumption. Testing with Japan AISI's open-source aisev harness showed default provider settings (temperature 1.0) lead to significant per-item disagreement. This resulted in up to ~50% variability across 20 identical runs for items near decision boundaries. Even with temperature explicitly pinned to 0, 1-2 out of 7 borderline items remained non-reproducible across 690 API calls. These calls spanned two providers, three model tiers, and five sampling configurations. Notably, Claude Opus 4.7/4.8 has deprecated temperature control, making this mitigation obsolete for newer models. These findings highlight a structural gap: evaluation harnesses reporting single-run verdicts without variance can misrepresent noise as a safety property. The authors recommend treating grader disagreement as a first-class health metric.

Key takeaway

For MLOps Engineers deploying LLM-as-judge systems for safety evaluations, you must explicitly manage sampling parameters. Your current evaluations, even with temperature set to 0, may yield non-reproducible results, potentially misrepresenting safety properties. Implement robust variance reporting and treat grader disagreement as a primary health metric. Be aware that some newer models, like Claude Opus 4.7/4.8, no longer support temperature control, requiring alternative reproducibility strategies.

Key insights

LLM-as-judge evaluations are not deterministic even at temperature 0, impacting safety assessment reproducibility.

Principles

Default LLM provider settings can override evaluation harness parameters.
Reproducibility issues persist even with forced greedy decoding.
Grader disagreement is a critical metric for evaluation harness health.

Method

Testing involved 690 API calls across two providers, three model tiers, and five sampling configurations to assess reproducibility.

In practice

Implement explicit temperature and seed settings in LLM-as-judge harnesses.
Monitor and report grader disagreement metrics alongside evaluation scores.
Account for provider-specific LLM behavior, like deprecated temperature controls.

Topics

LLM-as-Judge
Safety Evaluation
Reproducibility
Sampling Temperature
Claude Opus
AISEV

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.