Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations
Summary
A study on LLM-as-judge safety evaluations reveals that setting the sampling temperature to 0 does not guarantee deterministic grading, challenging a widespread assumption. Testing with Japan AISI's open-source aisev harness showed default provider settings (temperature 1.0) lead to significant per-item disagreement. This resulted in up to ~50% variability across 20 identical runs for items near decision boundaries. Even with temperature explicitly pinned to 0, 1-2 out of 7 borderline items remained non-reproducible across 690 API calls. These calls spanned two providers, three model tiers, and five sampling configurations. Notably, Claude Opus 4.7/4.8 has deprecated temperature control, making this mitigation obsolete for newer models. These findings highlight a structural gap: evaluation harnesses reporting single-run verdicts without variance can misrepresent noise as a safety property. The authors recommend treating grader disagreement as a first-class health metric.
Key takeaway
For MLOps Engineers deploying LLM-as-judge systems for safety evaluations, you must explicitly manage sampling parameters. Your current evaluations, even with temperature set to 0, may yield non-reproducible results, potentially misrepresenting safety properties. Implement robust variance reporting and treat grader disagreement as a primary health metric. Be aware that some newer models, like Claude Opus 4.7/4.8, no longer support temperature control, requiring alternative reproducibility strategies.
Key insights
LLM-as-judge evaluations are not deterministic even at temperature 0, impacting safety assessment reproducibility.
Principles
- Default LLM provider settings can override evaluation harness parameters.
- Reproducibility issues persist even with forced greedy decoding.
- Grader disagreement is a critical metric for evaluation harness health.
Method
Testing involved 690 API calls across two providers, three model tiers, and five sampling configurations to assess reproducibility.
In practice
- Implement explicit temperature and seed settings in LLM-as-judge harnesses.
- Monitor and report grader disagreement metrics alongside evaluation scores.
- Account for provider-specific LLM behavior, like deprecated temperature controls.
Topics
- LLM-as-Judge
- Safety Evaluation
- Reproducibility
- Sampling Temperature
- Claude Opus
- AISEV
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.