Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

2026-05-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A new evaluation protocol investigates the accuracy and stability of Large Language Models (LLMs) on deterministic programming tasks, revealing that standard run-level pass rates can overstate retry-free coverage by up to 17.8 percentage points. This discrepancy is most pronounced in mid-performing systems and can even reverse model rankings among closely matched LLMs. The study evaluated 16 models from five provider families on 100 LeetCode-style problems, using two prompt templates and five repeated runs per problem, generating 16,000 evaluation instances. While run-level pass rate and perfect stability rate show a strong correlation (r=0.985), the consistent gap highlights the need for stability metrics. Furthermore, prompt effects were found to be model-dependent rather than universally beneficial. These findings underscore that repeated-run stability analysis is crucial for comprehensive LLM evaluation in deterministic text-conditioned generation.

Key takeaway

For Machine Learning Engineers deploying LLMs for deterministic programming tasks, relying solely on single-run pass rates risks misrepresenting model performance and stability. You should integrate a repeated-run evaluation protocol, specifically tracking retry-free coverage and per-problem variability, to gain a more accurate understanding of an LLM's consistency. This approach will prevent misranking closely matched systems and ensure your chosen models deliver reliable, consistent outcomes in production environments.

Key insights

LLM evaluation for deterministic tasks requires repeated-run stability analysis to complement conventional single-run accuracy reporting.

Principles

Run-level pass rate overstates retry-free coverage.
Consistent outcomes require stability metrics.
Prompt effects are model-dependent.

Method

A repeated-run evaluation protocol measures run-level accuracy, retry-free coverage, and per-problem variability across multiple invocations for deterministic text-conditioned generation.

In practice

Implement repeated-run LLM evaluations.
Track retry-free coverage alongside pass rate.
Test prompt templates per model.

Topics

Large Language Models
LLM Evaluation
Code Generation
Model Stability
Deterministic Tasks
Performance Metrics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.