🇨🇳cnBeta (Full RSS)•Mar 19, 2026Stalecollected in 19h

ChatGPT Contradicts on Repeated Science Questions

Post LinkedIn

🇨🇳Read original on cnBeta (Full RSS)

#llm-limitations #consistency #misinformationchatgpt

💡WSU study: ChatGPT self-contradicts on science—key warning for LLM reliability in apps.

⚡ 30-Second TL;DR

What Changed

Washington State University LLM evaluation

Why It Matters

Exposes LLM limitations in reliability for science tasks, pushing AI builders toward hybrid verification systems and prompt engineering improvements.

What To Do Next

Run 10 identical science prompts on ChatGPT to benchmark consistency in your RAG pipeline.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•The study tested ChatGPT across two versions (GPT-3.5 in 2024 and GPT-5 mini in 2025) using 719 hypotheses from business journal papers published since 2021, revealing that accuracy improvements from 76.5% to 80% remain marginal when accounting for random guessing baseline (50% chance on true/false questions).
•ChatGPT's inconsistency rate is severe: when identical prompts were repeated 10 times, the model achieved only 73% consistency in statement evaluation, meaning roughly 1 in 4 repeated queries produced different answers on the same scientific claim.
•The model exhibits asymmetric failure modes, correctly identifying false hypotheses only 16.4% of the time—substantially worse than true hypothesis identification—suggesting ChatGPT has a systematic bias toward confirming statements rather than detecting falsehoods.
•Researcher Mesut Cicek characterized current AI tools as memorization systems without genuine comprehension, stating they 'don't understand the world the way we do' and 'don't have a brain,' framing the accuracy ceiling as a fundamental architectural limitation rather than a training data problem.

🔮 Future ImplicationsAI analysis grounded in cited sources

ChatGPT reliability for scientific fact-checking remains below institutional standards for decision-making.

Performance only 60% better than random chance on true/false scientific questions suggests the model cannot be trusted as a primary source for validating research claims in academic or professional contexts.

Inconsistency in repeated queries poses risks for reproducibility in AI-assisted research workflows.

The 73% consistency rate means researchers using ChatGPT for literature review or hypothesis evaluation may receive contradictory guidance across sessions, undermining scientific rigor.

⏳ Timeline

2024

WSU researchers test ChatGPT-3.5 against 719 scientific hypotheses from business journals; achieve 76.5% accuracy

2025

Study repeated with ChatGPT-5 mini; accuracy improves marginally to 80%, but consistency remains problematic at 73%

2026-03

WSU publishes findings showing ChatGPT receives 'D' grade for reliability on scientific hypothesis evaluation; study receives widespread media coverage

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🇨🇳Read original article on cnBeta (Full RSS)

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #llm-limitations

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: cnBeta (Full RSS) ↗