Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
Summary
A new evaluation protocol investigates the accuracy and stability of Large Language Models (LLMs) on deterministic programming tasks, revealing that standard run-level pass rates can overstate retry-free coverage by up to 17.8 percentage points. This discrepancy is most pronounced in mid-performing systems and can even reverse model rankings among closely matched LLMs. The study evaluated 16 models from five provider families on 100 LeetCode-style problems, using two prompt templates and five repeated runs per problem, generating 16,000 evaluation instances. While run-level pass rate and perfect stability rate show a strong correlation (r=0.985), the consistent gap highlights the need for stability metrics. Furthermore, prompt effects were found to be model-dependent rather than universally beneficial. These findings underscore that repeated-run stability analysis is crucial for comprehensive LLM evaluation in deterministic text-conditioned generation.
Key takeaway
For Machine Learning Engineers deploying LLMs for deterministic programming tasks, relying solely on single-run pass rates risks misrepresenting model performance and stability. You should integrate a repeated-run evaluation protocol, specifically tracking retry-free coverage and per-problem variability, to gain a more accurate understanding of an LLM's consistency. This approach will prevent misranking closely matched systems and ensure your chosen models deliver reliable, consistent outcomes in production environments.
Key insights
LLM evaluation for deterministic tasks requires repeated-run stability analysis to complement conventional single-run accuracy reporting.
Principles
- Run-level pass rate overstates retry-free coverage.
- Consistent outcomes require stability metrics.
- Prompt effects are model-dependent.
Method
A repeated-run evaluation protocol measures run-level accuracy, retry-free coverage, and per-problem variability across multiple invocations for deterministic text-conditioned generation.
In practice
- Implement repeated-run LLM evaluations.
- Track retry-free coverage alongside pass rate.
- Test prompt templates per model.
Topics
- Large Language Models
- LLM Evaluation
- Code Generation
- Model Stability
- Deterministic Tasks
- Performance Metrics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.