Study finds ChatGPT gets science wrong more often than you think
Summary
A new study by Washington State University researchers, led by Professor Mesut Cicek, found that ChatGPT exhibits significant limitations in scientific reasoning and consistency. The study, published in the *Rutgers Business Review*, tested ChatGPT-3.5 (in 2024) and ChatGPT-5 mini (in 2025) against over 700 scientific hypotheses, asking the AI to judge their truthfulness ten times each. While initial accuracy appeared to be around 80%, adjusting for random guessing revealed the AI performed only about 60% better than chance, akin to a low D grade. Critically, ChatGPT frequently contradicted itself, providing inconsistent answers to identical prompts approximately 27% of the time, sometimes flipping between "true" and "false" for the same question. The AI struggled most with identifying false statements, correctly labeling only 16.4% of them.
Key takeaway
For CTOs and VPs of Engineering evaluating AI tools for critical decision support, this study underscores the need for extreme caution. Your teams should implement robust verification processes for any AI-generated insights, especially in nuanced or complex domains. Do not mistake AI's fluent output for genuine understanding or consistent reasoning; instead, approach AI recommendations with skepticism and ensure human oversight to mitigate risks of inaccuracy and contradiction.
Key insights
ChatGPT's scientific reasoning and consistency are limited, often guessing and contradicting itself despite fluent output.
Principles
- AI fluency does not equate to conceptual understanding.
- Current LLMs lack human-like "brain" for world understanding.
Method
Researchers evaluated 719 scientific hypotheses by repeatedly (10 times each) prompting ChatGPT-3.5 and ChatGPT-5 mini to determine if claims were true or false, measuring accuracy and consistency.
In practice
- Verify AI-generated information for critical decisions.
- Train teams on AI system capabilities and limitations.
Topics
- ChatGPT
- AI Accuracy
- AI Consistency
- Large Language Models
- Scientific Reasoning
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Researcher, AI Product Manager, Executive
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Robotics Research News -- ScienceDaily.