Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Expert, extended

Summary

A study by Mohammadi, Gaur, and Ferraro from the University of Maryland, Baltimore County, investigates the capacity of large language models (LLMs) to perform scientific feasibility assessment. This involves determining if a scientific claim aligns with established knowledge and can be supported or refuted by experiments. The researchers developed a controlled-knowledge framework, evaluating LLMs like gpt-5.1, gpt-4o, Gemini-2.5-Pro, Gemini-2.5-Flash, and Grok-4.1-fast under varying conditions: hypothesis-only, with experiment descriptions, with outcome summaries, or with both. Using the Matter-of-Fact and Reasons datasets, they found that providing outcome evidence generally improves LLM accuracy more reliably than experiment descriptions. Furthermore, partial experimental context can degrade performance, sometimes below the hypothesis-only baseline, indicating brittle reasoning rather than robust uncertainty handling.

Key takeaway

Research Scientists deploying LLMs in scientific workflows should recognize that LLMs treat feasibility as surface-level classification. You must prioritize providing explicit outcome evidence over experiment descriptions, as outcomes offer more reliable grounding. Be cautious when using partial experimental context, as it can actively mislead models and degrade performance below a hypothesis-only baseline, necessitating diagnostic evaluations to uncover such brittle reasoning.

Key insights

Outcome evidence is more reliable for LLM scientific feasibility assessment than experiment descriptions alone.

Principles

Method

A controlled-knowledge framework systematically varies context (hypothesis-only, experiments, outcomes, or both) alongside a hypothesis to measure shifts in LLM feasibility judgments and robustness under progressive evidence removal.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.