Augmenting Molecular Language Models with Local $n$-gram Memory

2026-06-10 · Source: Artificial Intelligence · Field: Science & Research — Artificial Intelligence & Machine Learning, Life Sciences & Biology · Depth: Expert, quick

Summary

MolGram, a novel conditional $n$-gram memory module, addresses the "locality gap" in Transformer-based language models used for SMILES strings. These models typically struggle because character-level tokenization fragments chemically meaningful motifs, forcing them to repeatedly learn local syntax rather than focusing on long-range dependencies. MolGram tackles this by mapping local string patterns to learned embeddings using scalable hash lookups and dynamically injecting this regional context into the hidden states of molecular language models. Evaluated across three tasks—unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis—MolGram consistently demonstrated improved performance. Notably, it outperformed baselines with 3× more parameters, establishing explicit local pattern memory as a highly efficient inductive bias for molecular language models.

Key takeaway

For AI Scientists and Machine Learning Engineers developing molecular language models, if you are encountering performance limitations due to tokenization issues with SMILES strings, consider integrating an explicit local $n$-gram memory module like MolGram. This approach can significantly improve model efficiency and accuracy across tasks such as molecule generation and reaction prediction, potentially outperforming larger models. You should investigate its application to your specific chemical domain tasks.

Key insights

Integrating local $n$-gram memory via MolGram efficiently resolves the locality gap in molecular language models, improving performance with fewer parameters.

Principles

Character-level tokenization creates a "locality gap".
Explicit local pattern memory is an efficient inductive bias.
$n$-gram memory can enhance long-range dependency learning.

Method

MolGram maps local string patterns to learned embeddings using scalable hash lookups, then dynamically injects this regional context into the hidden states of molecular language models.

In practice

Improve SMILES-based molecule generation.
Enhance forward reaction prediction accuracy.
Boost single-step retrosynthesis performance.

Topics

Molecular Language Models
SMILES Strings
n-gram Memory
Transformer Models
Chemical Informatics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.