LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

2026-04-09 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

LaCy is a novel pretraining method for Small Language Models (SLMs) designed to address their limited capacity and propensity for factual errors. The method focuses on determining which tokens an SLM should learn to predict versus which it should delegate to an external source, such as a larger model or database. Researchers found that this decision is not solely based on prediction loss; some high-loss tokens are acceptable as truthful alternatives. LaCy utilizes a spaCy grammar parser to augment the loss signal, helping SLMs identify tokens safe to learn and predict versus those requiring delegation to prevent factual inaccuracies. Experiments show LaCy models effectively learn this delegation, achieving higher FactScores in cascaded generation with larger models and outperforming Rho or LLM-judge trained SLMs, while being simpler and more cost-effective.

Key takeaway

For AI Engineers developing or deploying Small Language Models, consider integrating LaCy's token delegation strategy into your pretraining pipeline. This approach can significantly enhance factual accuracy and reduce computational costs compared to traditional methods, especially when SLMs operate in a cascaded setup with larger models. Prioritize grammar-aware delegation over purely loss-based decisions to mitigate factual errors.

Key insights

SLMs can improve factual accuracy by learning to delegate specific tokens rather than predicting all tokens.

Principles

SLM capacity is parameter-bound.
Loss alone is insufficient for token delegation.
Grammar parsing aids delegation decisions.

Method

LaCy uses a spaCy grammar parser to augment the loss signal, guiding SLMs to decide which tokens to predict directly and which to delegate to an external source for improved factual accuracy.

In practice

Integrate grammar parsing into SLM pretraining.
Consider token delegation for factual consistency.
Evaluate SLMs with FactScores in cascaded setups.

Topics

Small Language Models
Token Delegation
Pretraining Methods
spaCy Grammar Parser
Factual Accuracy

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.