TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

TFGN is a novel architectural overlay designed for transformer language models that enables task-free, replay-free continual pre-training without catastrophic forgetting at LLM scale. This method addresses the challenge of continually pre-training large language models on diverse text domains without relying on replay buffers, explicit task identifiers, or computationally expensive regularization penalties. TFGN achieves input-conditioned, parameter-efficient updates while preserving the core transformer architecture. Evaluated across six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) with 1B tokens per phase and at model scales up to ~9B parameters, TFGN demonstrated a backward transfer of -0.007 at LLaMA 3.1 8B Retrofit and HellaSwag retention of 0.506/0.504/0.510. It also showed significant L2-orthogonal gradient separation (>=99.59%) between domain pairs, alongside positive cross-domain forward transfer, such as a 26.8% drop in held-out JavaScript PPL from Python training at LLaMA-8B Retrofit.

Key takeaway

For research scientists developing large language models, TFGN offers a robust solution to catastrophic forgetting during continual pre-training. If your team is struggling with the computational overhead of replay buffers or the complexity of task-specific regularization, consider implementing TFGN's architectural overlay. This approach allows for efficient, domain-diverse learning without compromising previously acquired knowledge, potentially streamlining your LLM development and deployment cycles.

Key insights

TFGN enables replay-free, task-free continual pre-training for LLMs by architecturally separating read and write operations.

Principles

Method

TFGN uses an architectural overlay to produce input-conditioned, parameter-efficient updates. The forward pass is dense, but cross-domain parameter updates are structured to avoid writing to prior-domain subspaces.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.