TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Summary
TFGN is a novel architectural overlay designed for transformer language models that enables task-free, replay-free continual pre-training without catastrophic forgetting at LLM scale. This method addresses the challenge of continually pre-training large language models on diverse text domains without relying on replay buffers, explicit task identifiers, or computationally expensive regularization penalties. TFGN achieves input-conditioned, parameter-efficient updates while preserving the core transformer architecture. Evaluated across six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) with 1B tokens per phase and at model scales up to ~9B parameters, TFGN demonstrated a backward transfer of -0.007 at LLaMA 3.1 8B Retrofit and HellaSwag retention of 0.506/0.504/0.510. It also showed significant L2-orthogonal gradient separation (>=99.59%) between domain pairs, alongside positive cross-domain forward transfer, such as a 26.8% drop in held-out JavaScript PPL from Python training at LLaMA-8B Retrofit.
Key takeaway
For research scientists developing large language models, TFGN offers a robust solution to catastrophic forgetting during continual pre-training. If your team is struggling with the computational overhead of replay buffers or the complexity of task-specific regularization, consider implementing TFGN's architectural overlay. This approach allows for efficient, domain-diverse learning without compromising previously acquired knowledge, potentially streamlining your LLM development and deployment cycles.
Key insights
TFGN enables replay-free, task-free continual pre-training for LLMs by architecturally separating read and write operations.
Principles
- Separate read/write paths for updates
- Achieve orthogonal gradient separation
- Support autonomous meta-control
Method
TFGN uses an architectural overlay to produce input-conditioned, parameter-efficient updates. The forward pass is dense, but cross-domain parameter updates are structured to avoid writing to prior-domain subspaces.
In practice
- Apply TFGN to LLaMA 3.1 8B Retrofit
- Integrate closed-loop meta-control
- Utilize operator-level plan vectors
Topics
- TFGN Architecture
- Continual Pre-Training
- Catastrophic Forgetting
- Large Language Models
- Read/Write Decomposition
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.