Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Research Methodology & Innovation · Depth: Expert, extended

Summary

A study by Mohammadi, Gaur, and Ferraro from the University of Maryland, Baltimore County, investigates the capacity of large language models (LLMs) to perform scientific feasibility assessment. This involves determining if a scientific claim aligns with established knowledge and can be supported or refuted by experiments. The researchers developed a controlled-knowledge framework, evaluating LLMs like gpt-5.1, gpt-4o, Gemini-2.5-Pro, Gemini-2.5-Flash, and Grok-4.1-fast under varying conditions: hypothesis-only, with experiment descriptions, with outcome summaries, or with both. Using the Matter-of-Fact and Reasons datasets, they found that providing outcome evidence generally improves LLM accuracy more reliably than experiment descriptions. Furthermore, partial experimental context can degrade performance, sometimes below the hypothesis-only baseline, indicating brittle reasoning rather than robust uncertainty handling.

Key takeaway

Research Scientists deploying LLMs in scientific workflows should recognize that LLMs treat feasibility as surface-level classification. You must prioritize providing explicit outcome evidence over experiment descriptions, as outcomes offer more reliable grounding. Be cautious when using partial experimental context, as it can actively mislead models and degrade performance below a hypothesis-only baseline, necessitating diagnostic evaluations to uncover such brittle reasoning.

Key insights

Outcome evidence is more reliable for LLM scientific feasibility assessment than experiment descriptions alone.

Principles

Outcome evidence stabilizes feasibility judgments.
Partial evidence can actively mislead LLMs.
More evidence is not strictly better for LLMs.

Method

A controlled-knowledge framework systematically varies context (hypothesis-only, experiments, outcomes, or both) alongside a hypothesis to measure shifts in LLM feasibility judgments and robustness under progressive evidence removal.

In practice

Prioritize outcome summaries over experiment details for LLM inputs.
Beware of non-monotonic performance degradation with partial evidence.
Design diagnostic evaluations for LLMs in scientific workflows.

Topics

Large Language Models
Scientific Feasibility Assessment
Controlled Knowledge Framework
Outcome Evidence
Experimental Evidence

Code references

mohammadi-ali/scify

Best for: Research Scientist, AI Scientist, NLP Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.