Rethinking Molecular Text Representations for LLMs: An Empirical Study
Summary
A systematic benchmark evaluated Large Language Model (LLM) molecular competence across nine distinct molecular representations and eight chemical tasks. The study involved 16 LLMs from five model families, including reasoning, non-reasoning, chemistry-specialized, and closed frontier models. Findings indicate that LLM performance is highly dependent on the chosen representation, with no single representation consistently outperforming others across all tasks. CML emerged as the generally best representation, followed by MolJSON, InChI, and canonical SMILES. Explicit structured text representations like CML and MolJSON excelled in structural tasks, while IUPAC dominated semantic tasks, winning molecule retrieval for all 16 LLMs and producing the highest fraction of correct molecule generations. SMILES variants, despite their prevalence in pretraining, were rarely optimal. Chemistry-specialized models showed strong performance with SMILES but degraded significantly with structured text, suggesting a lack of generalization. The research advocates for task-aware representation routing rather than representation-invariant evaluation for LLM-based chemistry.
Key takeaway
For Machine Learning Engineers developing LLM-based chemistry applications, your choice of molecular representation is critical and task-dependent. Do not rely solely on SMILES, as it is rarely optimal and specialized models show poor generalization. Instead, route representations strategically: use CML or MolJSON for structural tasks, and IUPAC for semantic tasks like molecule retrieval or generation, to maximize LLM performance and accuracy.
Key insights
LLM molecular competence is strongly representation-dependent, with no single representation optimal across all tasks.
Principles
- LLM molecular competence is representation-dependent.
- Explicit structured text excels in structural tasks.
- IUPAC representation dominates semantic tasks.
Method
Systematically benchmark LLM molecular competence across nine representations and eight tasks using 16 LLMs from five model families.
In practice
- Route representations based on task type.
- Prioritize CML, MolJSON for structural tasks.
- Use IUPAC for molecule retrieval and generation.
Topics
- Large Language Models
- Molecular Representations
- Computational Chemistry
- LLM Evaluation
- SMILES
- IUPAC
Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.