Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
Summary
A study investigates the ability of large language models (LLMs) to perform scientific feasibility assessment, a diagnostic reasoning task determining if a hypothesis aligns with established knowledge and can be experimentally supported or refuted. The research evaluates LLMs under controlled knowledge conditions, including hypothesis-only, with experiment descriptions, with experimental outcomes, or with both. By progressively removing portions of the experimental and outcome context, the study probes the robustness of LLMs. Across multiple LLMs and two distinct datasets, the findings indicate that providing outcome evidence generally yields more reliable assessments than providing experiment descriptions. Outcomes consistently improve accuracy beyond the LLM's internal knowledge, while experimental text can introduce fragility and degrade performance if the context is incomplete.
Key takeaway
For research scientists evaluating LLM capabilities in scientific reasoning, you should prioritize providing concrete experimental outcomes over detailed experimental descriptions. Relying solely on experimental text, especially if incomplete, can introduce fragility and reduce the accuracy of feasibility assessments. Focus on the results to enhance the reliability of your LLM-based diagnostic reasoning tasks.
Key insights
Outcome evidence is more reliable than experiment descriptions for LLM scientific feasibility assessment.
Principles
- Outcome evidence improves LLM accuracy.
- Incomplete experimental text degrades performance.
Method
Feasibility assessment is framed as a diagnostic reasoning task where LLMs predict "feasible" or "infeasible" and justify their decision, evaluated under controlled knowledge conditions.
In practice
- Prioritize outcome data for LLM assessment.
- Be wary of incomplete experimental context.
Topics
- Large Language Models
- Scientific Feasibility Assessment
- Diagnostic Reasoning
- Experimental Evidence
- Outcome Evidence
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.