Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

2026-04-20 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study investigates the ability of large language models (LLMs) to perform scientific feasibility assessment, a diagnostic reasoning task determining if a hypothesis aligns with established knowledge and can be experimentally supported or refuted. The research evaluates LLMs under controlled knowledge conditions, including hypothesis-only, with experiment descriptions, with experimental outcomes, or with both. By progressively removing portions of the experimental and outcome context, the study probes the robustness of LLMs. Across multiple LLMs and two distinct datasets, the findings indicate that providing outcome evidence generally yields more reliable assessments than providing experiment descriptions. Outcomes consistently improve accuracy beyond the LLM's internal knowledge, while experimental text can introduce fragility and degrade performance if the context is incomplete.

Key takeaway

For research scientists evaluating LLM capabilities in scientific reasoning, you should prioritize providing concrete experimental outcomes over detailed experimental descriptions. Relying solely on experimental text, especially if incomplete, can introduce fragility and reduce the accuracy of feasibility assessments. Focus on the results to enhance the reliability of your LLM-based diagnostic reasoning tasks.

Key insights

Outcome evidence is more reliable than experiment descriptions for LLM scientific feasibility assessment.

Principles

Outcome evidence improves LLM accuracy.
Incomplete experimental text degrades performance.

Method

Feasibility assessment is framed as a diagnostic reasoning task where LLMs predict "feasible" or "infeasible" and justify their decision, evaluated under controlled knowledge conditions.

In practice

Prioritize outcome data for LLM assessment.
Be wary of incomplete experimental context.

Topics

Large Language Models
Scientific Feasibility Assessment
Diagnostic Reasoning
Experimental Evidence
Outcome Evidence

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.