SupraBench: A Benchmark for Supramolecular Chemistry

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Computational Chemistry · Depth: Advanced, extended

Summary

The SupraBench initiative introduces the first benchmark designed to systematically evaluate large language models (LLMs) in supramolecular chemistry reasoning. Developed in collaboration with domain experts, SupraBench comprises four fundamental tasks: binding affinity prediction, top-binder selection, solvent identification, and host–guest description, alongside an auxiliary vision-based molecular identification task. Researchers also released SupraPmc, a curated 16M-token corpus of supramolecular chemistry articles from Europe PMC, to facilitate domain adaptation. Evaluations of various open and proprietary LLMs, including Gemini-3-Flash and GPT-5.4, indicate significant performance gaps across all tasks. While domain adaptation pretraining on SupraPmc improves in-distribution regression, it can negatively impact strict letter-format outputs. The benchmark highlights distinct failure modes and substantial headroom for LLMs in complex supramolecular chemical reasoning.

Key takeaway

For AI Scientists developing LLMs for chemistry, you must recognize that current models, even frontier ones like Gemini-3-Flash, exhibit significant limitations in supramolecular reasoning. Do not rely on Chain-of-Thought prompting as a universal fix, as it can fabricate incorrect chemical knowledge. Instead, focus on targeted domain adaptation with resources like SupraPmc, but be aware of uneven transfer effects, especially for strict output formats. Prioritize improving precise bond-level reasoning over abstract visual comprehension.

Key insights

Current LLMs lack robust supramolecular chemistry reasoning, showing significant headroom across all benchmark tasks.

Principles

Frontier LLMs lead but leave substantial performance headroom.
Prompting strategies are task- and model-dependent.
CoT can amplify reasoning gaps, not fix them.

Method

SupraBench evaluates LLMs on five tasks: binding affinity prediction, top-binder selection, solvent identification, host–guest description, and molecular identification. It uses a 16M-token corpus, SupraPmc, for domain adaptation.

In practice

Use SupraBench to evaluate LLM performance in supramolecular tasks.
Apply SupraPmc for domain-specific LLM pretraining.
Tailor prompting strategies to specific tasks and models.

Topics

Supramolecular Chemistry
Large Language Models
Chemical Benchmarking
Host-Guest Systems
Domain Adaptation
Molecular Identification

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.