LinkBERT: Improving Language Model Training with Document Link
Summary
LinkBERT is a novel language model pretraining method that enhances knowledge acquisition by incorporating document links, such as hyperlinks and citations, into the training process. Unlike traditional methods that process documents independently, LinkBERT constructs a document graph and creates "link-aware" training instances by concatenating segments from linked documents. It employs two self-supervised tasks: masked language modeling (MLM) to learn multi-hop knowledge and document relation prediction (DRP) to classify segment relationships (contiguous, random, or linked). Evaluated on Wikipedia and PubMed corpora, LinkBERT consistently outperforms baseline BERT models across general and biomedical NLP tasks, showing significant gains in multi-hop reasoning, robustness to distracting documents, and few-shot question answering, with BioLinkBERT achieving new state-of-the-art performance on BLURB, MedQA, and MMLU benchmarks.
Key takeaway
For AI Scientists and Research Scientists developing or deploying language models, LinkBERT offers a direct path to improving model performance on knowledge-intensive and multi-hop reasoning tasks. You should consider integrating LinkBERT or BioLinkBERT from HuggingFace into your projects, especially for applications where information is distributed across multiple linked documents, such as question answering or knowledge discovery. This approach can lead to more robust and data-efficient models, even with limited finetuning data.
Key insights
Incorporating document links during pretraining significantly boosts language models' multi-hop reasoning and knowledge acquisition.
Principles
- Knowledge spans multiple documents.
- Document links provide high-quality relevance signals.
- Joint self-supervised tasks enhance learning.
Method
LinkBERT constructs a document graph, creates link-aware input sequences by concatenating linked document segments, and trains LMs using masked language modeling and document relation prediction tasks.
In practice
- Use LinkBERT as a drop-in replacement for BERT.
- Apply BioLinkBERT for biomedical NLP tasks.
- Finetune LinkBERT with limited data for QA.
Topics
- LinkBERT
- Language Model Pretraining
- Multi-hop Reasoning
- Document Graph
- Few-shot Learning
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Stanford AI Lab Blog.