A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

This study examines training dynamics in a small Llama-style language model, specifically a 4.26-million-parameter model, trained under a fixed, compute-constrained token budget of approximately 20 million cumulative training tokens using the TinyStories corpus and CPU-based full-precision training. Six independent training runs were conducted, collecting metrics across 21 intervals, yielding 126 observations. Repeated measures ANOVA revealed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but then increased to 3.9010 by the final 20,000,768-token checkpoint. Validation perplexity followed a similar pattern. Derived telemetry indicated recurrent validation-loss backslides and no evidence of a stable phase, suggesting that compute-aware language model evaluation should analyze training trajectories rather than endpoint metrics alone, as additional token exposure can increase cost without proportional generalization gains.

Key takeaway

For Machine Learning Engineers optimizing LLM training in resource-constrained environments, you should prioritize continuous monitoring of validation loss and perplexity trajectories. Relying solely on final performance metrics can obscure critical non-monotonic behavior, such as early degradation or persistent instability. Implement interval-level telemetry to identify optimal stopping points, preventing unnecessary computational cost and potential regression in generalization performance from continued training.

Key insights

Compute-constrained LLM training trajectories reveal non-monotonic degradation and instability, challenging endpoint-only evaluation.

Principles

Training dynamics are non-linear and non-monotonic.
Endpoint metrics alone obscure critical training behavior.
Additional tokens can increase cost without generalization gains.

Method

A quantitative experimental repeated measures design analyzed validation loss, perplexity, and volatility across 21 token-based intervals for a 4.26M-parameter Llama-style model on TinyStories, using 6 CPU-based runs.

In practice

Monitor validation loss trajectories, not just final scores.
Implement interval-level telemetry for stability metrics.
Consider early stopping when degradation or instability emerges.

Topics

LLM Training Dynamics
Compute-Constrained Training
Validation Loss
Perplexity
Repeated Measures ANOVA
TinyStories Corpus

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.