SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SciR is a novel benchmark designed to evaluate Large Language Models (LLMs) on scientific reasoning, specifically deduction, induction, and causal abduction. Addressing limitations of existing benchmarks, SciR generates tasks from formal objects like deduction trees and causal graphs, ensuring verifiable ground truth. These tasks are then rendered into multi-document scientific discourse using domain-tuned genres. This construction allows independent control over two difficulty axes: the complexity of extracting key information and the inherent difficulty of the principled inference itself. Initial tests on six models demonstrated that both axes significantly degrade performance, with their effects compounding. Notably, the rendering process impacted even neurosymbolic pipelines. SciR provides a unique "extraction-vs-inference profile" for models, showing reasoning models like deepseek-r1 outperform instruct models on inference tasks. It is presented as the first multi-paradigm scientific-reasoning benchmark offering parametric control over both extraction and inference difficulty.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or evaluating LLMs for scientific applications, you should recognize that both information extraction and core inference capabilities are critical and compound in difficulty. When designing models, focus on improving performance across both axes, as even neurosymbolic approaches are impacted by rendering complexity. Utilize benchmarks like SciR to gain a granular "extraction-vs-inference profile" for your models, guiding targeted improvements in scientific reasoning.

Key insights

SciR is a new benchmark controlling scientific reasoning difficulty for LLMs via extraction and inference axes.

Principles

Method

SciR generates tasks from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then renders them into multi-document scientific discourse via domain-tuned genres. This allows varying extraction and inference difficulty.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.