Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, quick

Summary

Nous Research has introduced Token Superposition Training (TST), a novel two-phase modification to the standard pre-training loop for Large Language Models (LLMs) that significantly accelerates the process without altering the tokenizer, architecture, or inference behavior. TST averages 's' contiguous token embeddings into a single latent s-token during Phase 1, training with a multi-hot cross-entropy loss against the next bag of tokens. Phase 2 then reverts to standard next-token prediction from the same checkpoint, with TST code removed. This method achieves up to 2.5x speedup, with a 3B dense model reaching a loss of 2.676 in 247 B200-hours compared to 443 B200-hours for baseline, and a 10B-A1B MoE model completing in 4,768 B200-hrs versus 12,311 B200-hrs. Each TST step maintains equal FLOPs to the baseline by increasing data sequence length by 's' times, not batch size. Optimal bag sizes 's' range from 3-8 for 270M models, 6-10 for 600M models, and 16 for 10B models, with a step ratio 'r' between 0.2 and 0.4. Re-initializing the embedding or LM head at the phase boundary degrades performance.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM pre-training, consider integrating Token Superposition Training (TST) into your workflow. This method offers substantial speedups, up to 2.5x, across models from 270M to 10B parameters, without requiring architectural changes. Ensure you maintain the embedding and LM head weights across the TST phase boundary to preserve performance gains and achieve optimal results.

Key insights

Token Superposition Training accelerates LLM pre-training by modifying the training loop without changing model architecture or inference.

Principles

Method

TST involves a two-phase pre-training loop: Phase 1 averages 's' token embeddings into a single latent s-token, training with multi-hot cross-entropy loss; Phase 2 reverts to standard next-token prediction.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.