Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Summary
A study compared the consistency of exercise prescription outputs from GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, each generating 20 prescriptions for six clinical scenarios under temperature=0. The analysis focused on semantic similarity, output reproducibility, FITT classification, and safety expression across 360 total outputs. GPT-4.1 achieved the highest mean semantic similarity at 0.955, followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences (H = 458.41, p < .001). Notably, GPT-4.1 produced 100% unique outputs while maintaining semantic stability, whereas Gemini 2.5 Flash exhibited 27.5% unique outputs, indicating its high similarity stemmed from text duplication. Safety expression was consistently high across all models, limiting its use as a differentiator.
Key takeaway
For AI Engineers developing LLM-based clinical support systems, you must evaluate model consistency using repeated generation studies. Relying solely on single-output evaluations or semantic similarity scores can obscure critical differences in generative behavior, such as text duplication, which impacts reliability. Prioritize models demonstrating high semantic stability across unique outputs for dependable clinical deployment.
Key insights
LLM consistency for clinical tasks varies significantly, demanding careful evaluation beyond single-output metrics.
Principles
- Semantic similarity can mask text duplication.
- Reproducibility reveals generative behavior.
- Model selection is a clinical decision.
Method
Repeated generation (20 times) for six clinical scenarios under temperature=0, then analyzed for semantic similarity, output reproducibility, FITT classification, and safety expression.
In practice
- Evaluate LLMs with repeated generation.
- Check for text duplication in outputs.
- Prioritize reproducibility for clinical use.
Topics
- Large Language Models
- Exercise Prescription
- Cross-Model Consistency
- Semantic Similarity
- Output Reproducibility
Best for: AI Engineer, NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.