Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
Summary
A study by Mohammadi, Gaur, and Ferraro from the University of Maryland, Baltimore County, investigates the capacity of large language models (LLMs) to perform scientific feasibility assessment. This involves determining if a scientific claim aligns with established knowledge and can be supported or refuted by experiments. The researchers developed a controlled-knowledge framework, evaluating LLMs like gpt-5.1, gpt-4o, Gemini-2.5-Pro, Gemini-2.5-Flash, and Grok-4.1-fast under varying conditions: hypothesis-only, with experiment descriptions, with outcome summaries, or with both. Using the Matter-of-Fact and Reasons datasets, they found that providing outcome evidence generally improves LLM accuracy more reliably than experiment descriptions. Furthermore, partial experimental context can degrade performance, sometimes below the hypothesis-only baseline, indicating brittle reasoning rather than robust uncertainty handling.
Key takeaway
Research Scientists deploying LLMs in scientific workflows should recognize that LLMs treat feasibility as surface-level classification. You must prioritize providing explicit outcome evidence over experiment descriptions, as outcomes offer more reliable grounding. Be cautious when using partial experimental context, as it can actively mislead models and degrade performance below a hypothesis-only baseline, necessitating diagnostic evaluations to uncover such brittle reasoning.
Key insights
Outcome evidence is more reliable for LLM scientific feasibility assessment than experiment descriptions alone.
Principles
- Outcome evidence stabilizes feasibility judgments.
- Partial evidence can actively mislead LLMs.
- More evidence is not strictly better for LLMs.
Method
A controlled-knowledge framework systematically varies context (hypothesis-only, experiments, outcomes, or both) alongside a hypothesis to measure shifts in LLM feasibility judgments and robustness under progressive evidence removal.
In practice
- Prioritize outcome summaries over experiment details for LLM inputs.
- Beware of non-monotonic performance degradation with partial evidence.
- Design diagnostic evaluations for LLMs in scientific workflows.
Topics
- Large Language Models
- Scientific Feasibility Assessment
- Controlled Knowledge Framework
- Outcome Evidence
- Experimental Evidence
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.