Augmenting Molecular Language Models with Local $n$-gram Memory
Summary
MolGram, a novel conditional $n$-gram memory module, addresses the "locality gap" in Transformer-based language models used for SMILES strings. These models typically struggle because character-level tokenization fragments chemically meaningful motifs, forcing them to repeatedly learn local syntax rather than focusing on long-range dependencies. MolGram tackles this by mapping local string patterns to learned embeddings using scalable hash lookups and dynamically injecting this regional context into the hidden states of molecular language models. Evaluated across three tasks—unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis—MolGram consistently demonstrated improved performance. Notably, it outperformed baselines with 3× more parameters, establishing explicit local pattern memory as a highly efficient inductive bias for molecular language models.
Key takeaway
For AI Scientists and Machine Learning Engineers developing molecular language models, if you are encountering performance limitations due to tokenization issues with SMILES strings, consider integrating an explicit local $n$-gram memory module like MolGram. This approach can significantly improve model efficiency and accuracy across tasks such as molecule generation and reaction prediction, potentially outperforming larger models. You should investigate its application to your specific chemical domain tasks.
Key insights
Integrating local $n$-gram memory via MolGram efficiently resolves the locality gap in molecular language models, improving performance with fewer parameters.
Principles
- Character-level tokenization creates a "locality gap".
- Explicit local pattern memory is an efficient inductive bias.
- $n$-gram memory can enhance long-range dependency learning.
Method
MolGram maps local string patterns to learned embeddings using scalable hash lookups, then dynamically injects this regional context into the hidden states of molecular language models.
In practice
- Improve SMILES-based molecule generation.
- Enhance forward reaction prediction accuracy.
- Boost single-step retrosynthesis performance.
Topics
- Molecular Language Models
- SMILES Strings
- n-gram Memory
- Transformer Models
- Chemical Informatics
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.