A survey on large language models in biology and chemistry

· Source: Machine learning : nature.com subject feeds · Field: Science & Research — Life Sciences & Biology, Physical Sciences & Chemistry, Health & Medical Research · Depth: Expert, extended

Summary

A 2026 review article in "Experimental & Molecular Medicine" surveys the application of large language models (LLMs) in biology and chemistry, highlighting their evolution from molecular representation to generation and optimization. The review details key molecular representation strategies for biological macromolecules (e.g., protein/nucleotide sequences, single-cell data) and small organic compounds (e.g., SMILES strings, graph-based encodings, 3D point clouds). It covers core model architectures like BERT-like encoders, GPT-like decoders, and encoder-decoder transformers, along with pretraining strategies such as self-supervised learning, multitask learning, and retrieval-augmented generation. Key biomedical applications include protein structure prediction, de novo protein/molecular design, genomic analysis, and reaction prediction. The article also explores emerging agentic and interactive AI systems for automating scientific discovery, while addressing technical, ethical, and regulatory considerations.

Key takeaway

For AI Scientists and Research Scientists developing computational tools in biomedicine, understanding the nuanced interplay between molecular representation, model architecture, and training strategies is crucial. You should prioritize developing robust tokenization and multimodal integration techniques to effectively translate complex biological and chemical data into formats LLMs can process, thereby accelerating discovery and design cycles. Consider adopting agentic AI systems to automate experimental design and hypothesis generation.

Key insights

LLMs are transforming molecular sciences by treating biological and chemical systems as structured languages.

Principles

Method

LLMs adapt to scientific domains by converting complex molecular information into processable formats, utilizing diverse architectures and pretraining strategies like self-supervised and multitask learning.

In practice

Topics

Best for: AI Scientist, Research Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.