Study finds ChatGPT gets science wrong more often than you think

2026-03-18 · Source: Robotics Research News -- ScienceDaily · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Novice, short

Summary

A new study by Washington State University researchers, led by Professor Mesut Cicek, found that ChatGPT exhibits significant limitations in scientific reasoning and consistency. The study, published in the *Rutgers Business Review*, tested ChatGPT-3.5 (in 2024) and ChatGPT-5 mini (in 2025) against over 700 scientific hypotheses, asking the AI to judge their truthfulness ten times each. While initial accuracy appeared to be around 80%, adjusting for random guessing revealed the AI performed only about 60% better than chance, akin to a low D grade. Critically, ChatGPT frequently contradicted itself, providing inconsistent answers to identical prompts approximately 27% of the time, sometimes flipping between "true" and "false" for the same question. The AI struggled most with identifying false statements, correctly labeling only 16.4% of them.

Key takeaway

For CTOs and VPs of Engineering evaluating AI tools for critical decision support, this study underscores the need for extreme caution. Your teams should implement robust verification processes for any AI-generated insights, especially in nuanced or complex domains. Do not mistake AI's fluent output for genuine understanding or consistent reasoning; instead, approach AI recommendations with skepticism and ensure human oversight to mitigate risks of inaccuracy and contradiction.

Key insights

ChatGPT's scientific reasoning and consistency are limited, often guessing and contradicting itself despite fluent output.

Principles

AI fluency does not equate to conceptual understanding.
Current LLMs lack human-like "brain" for world understanding.

Method

Researchers evaluated 719 scientific hypotheses by repeatedly (10 times each) prompting ChatGPT-3.5 and ChatGPT-5 mini to determine if claims were true or false, measuring accuracy and consistency.

In practice

Verify AI-generated information for critical decisions.
Train teams on AI system capabilities and limitations.

Topics

ChatGPT
AI Accuracy
AI Consistency
Large Language Models
Scientific Reasoning

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Researcher, AI Product Manager, Executive

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Robotics Research News -- ScienceDaily.