Testing LLMs on superconductivity research questions

2026-03-16 · Source: The latest research from Google · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

Google Research, in collaboration with Cornell University, tested six large language models (LLMs) on their ability to answer expert-level questions in high-temperature superconductivity, a complex and evolving field of condensed matter physics. The study, published in the Proceedings of the National Academy of Sciences, involved a panel of experts who graded LLM responses to 67 challenging questions. The top performers were NotebookLM and a custom retrieval-augmented generation (RAG) system, both of which drew from a closed ecosystem of 1,726 curated, quality-controlled scientific sources, including 15 expert-selected review articles and their 3,300 cited references. In contrast, four web-based models with full internet access, including GPT-4o, Perplexity, Claude 3.5, and Gemini Advanced Pro 1.5, performed less effectively, often mixing established theories with speculative ones and showing weaknesses in temporal and contextual understanding. The research highlights the need for LLMs to improve visual reasoning and contextual understanding for scientific applications.

Key takeaway

For AI scientists developing tools for specialized scientific research, you should prioritize integrating retrieval-augmented generation (RAG) systems with meticulously curated, quality-controlled data sources. This approach, exemplified by NotebookLM's superior performance in high-temperature superconductivity, significantly improves accuracy and trustworthiness compared to models relying on unfiltered web data. Focus your development efforts on enhancing LLMs' ability to interpret scientific visuals and understand temporal context to address current limitations and accelerate scientific discovery.

Key insights

Curated data sources significantly enhance LLM accuracy and reliability in specialized scientific domains.

Principles

Closed, quality-controlled data outperforms open web sources for scientific accuracy.
LLMs require strong temporal and contextual understanding for complex scientific inquiry.

Method

Six LLMs were evaluated by a panel of experts on 67 high-temperature superconductivity questions. Responses were scored on balanced perspective, comprehensiveness, conciseness, evidence, and visual relevance.

In practice

Prioritize curated datasets for scientific LLM applications.
Focus LLM development on visual reasoning and contextual understanding.

Topics

High-temperature Superconductivity
Large Language Models
Retrieval-Augmented Generation
Scientific AI Evaluation
Condensed Matter Physics

Best for: AI Scientist, AI Researcher, Research Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.