Rethinking Molecular Text Representations for LLMs: An Empirical Study

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A systematic benchmark evaluated Large Language Model (LLM) molecular competence across nine distinct molecular representations and eight chemical tasks. The study involved 16 LLMs from five model families, including reasoning, non-reasoning, chemistry-specialized, and closed frontier models. Findings indicate that LLM performance is highly dependent on the chosen representation, with no single representation consistently outperforming others across all tasks. CML emerged as the generally best representation, followed by MolJSON, InChI, and canonical SMILES. Explicit structured text representations like CML and MolJSON excelled in structural tasks, while IUPAC dominated semantic tasks, winning molecule retrieval for all 16 LLMs and producing the highest fraction of correct molecule generations. SMILES variants, despite their prevalence in pretraining, were rarely optimal. Chemistry-specialized models showed strong performance with SMILES but degraded significantly with structured text, suggesting a lack of generalization. The research advocates for task-aware representation routing rather than representation-invariant evaluation for LLM-based chemistry.

Key takeaway

For Machine Learning Engineers developing LLM-based chemistry applications, your choice of molecular representation is critical and task-dependent. Do not rely solely on SMILES, as it is rarely optimal and specialized models show poor generalization. Instead, route representations strategically: use CML or MolJSON for structural tasks, and IUPAC for semantic tasks like molecule retrieval or generation, to maximize LLM performance and accuracy.

Key insights

LLM molecular competence is strongly representation-dependent, with no single representation optimal across all tasks.

Principles

LLM molecular competence is representation-dependent.
Explicit structured text excels in structural tasks.
IUPAC representation dominates semantic tasks.

Method

Systematically benchmark LLM molecular competence across nine representations and eight tasks using 16 LLMs from five model families.

In practice

Route representations based on task type.
Prioritize CML, MolJSON for structural tasks.
Use IUPAC for molecule retrieval and generation.

Topics

Large Language Models
Molecular Representations
Computational Chemistry
LLM Evaluation
SMILES
IUPAC

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.