A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget
Summary
This study examines training dynamics in a small Llama-style language model, specifically a 4.26-million-parameter model, trained under a fixed, compute-constrained token budget of approximately 20 million cumulative training tokens using the TinyStories corpus and CPU-based full-precision training. Six independent training runs were conducted, collecting metrics across 21 intervals, yielding 126 observations. Repeated measures ANOVA revealed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but then increased to 3.9010 by the final 20,000,768-token checkpoint. Validation perplexity followed a similar pattern. Derived telemetry indicated recurrent validation-loss backslides and no evidence of a stable phase, suggesting that compute-aware language model evaluation should analyze training trajectories rather than endpoint metrics alone, as additional token exposure can increase cost without proportional generalization gains.
Key takeaway
For Machine Learning Engineers optimizing LLM training in resource-constrained environments, you should prioritize continuous monitoring of validation loss and perplexity trajectories. Relying solely on final performance metrics can obscure critical non-monotonic behavior, such as early degradation or persistent instability. Implement interval-level telemetry to identify optimal stopping points, preventing unnecessary computational cost and potential regression in generalization performance from continued training.
Key insights
Compute-constrained LLM training trajectories reveal non-monotonic degradation and instability, challenging endpoint-only evaluation.
Principles
- Training dynamics are non-linear and non-monotonic.
- Endpoint metrics alone obscure critical training behavior.
- Additional tokens can increase cost without generalization gains.
Method
A quantitative experimental repeated measures design analyzed validation loss, perplexity, and volatility across 21 token-based intervals for a 4.26M-parameter Llama-style model on TinyStories, using 6 CPU-based runs.
In practice
- Monitor validation loss trajectories, not just final scores.
- Implement interval-level telemetry for stability metrics.
- Consider early stopping when degradation or instability emerges.
Topics
- LLM Training Dynamics
- Compute-Constrained Training
- Validation Loss
- Perplexity
- Repeated Measures ANOVA
- TinyStories Corpus
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.