Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition
Summary
Dango is a 1.8B-parameter decoder-only large language model developed for controlled studies of Japanese-to-English (L1→L2) transfer in second language acquisition (SLA). Addressing limitations of smaller or encoder-only models in prior work, Dango tackles the critical challenge of L2 contamination within its "monolingual" Japanese pretraining corpus. Researchers implemented a filtering method to minimize premature English exposure while preserving natural, minimal L2 signals. The model was then fine-tuned on LLM-generated L2-learning lessons, simulating the acquisition process. Evaluations confirm Dango exhibits human-like L2 production patterns, surpassing unfiltered and standard multilingual baselines, and aligning with prompted GPT-5.5. The model, data, and code are released to support reproducible computational SLA research.
Key takeaway
For research scientists or NLP engineers investigating language transfer or building specialized language models, you should prioritize strict control over pretraining data to prevent L2 contamination. Dango demonstrates that carefully filtered L1 corpora, combined with structured L2 fine-tuning, can yield models exhibiting human-like acquisition patterns. Consider adopting similar data filtering and LLM-generated lesson strategies to create robust, interpretable L2 simulators for computational SLA research or learner-facing applications.
Key insights
Dango is a 1.8B-parameter LLM for controlled L1→L2 transfer studies, using filtered L1 data and LLM-generated L2 lessons.
Principles
- L2 contamination impacts L1-only pretraining.
- Strict data filtering enables controlled SLA studies.
- LLM-generated lessons simulate L2 acquisition.
Method
Pretrain a 1.8B Llama-2-style decoder-only LLM on a strictly filtered L1 corpus, then fine-tune on LLM-generated, progressively difficult L2 lessons.
In practice
- Utilize the released Dango model for SLA research.
- Apply data filtering to create L1-only corpora.
- Employ LLM-as-a-judge for L2 output evaluation.
Topics
- Second Language Acquisition
- Language Transfer
- Large Language Models
- Data Filtering
- L1 Pretraining
- Japanese-English
Code references
- haykgrigo3/TimeCapsuleLLM
- DGoettlich/history-llms
- llm-jp/scripts
- openlanguageprofiles/olp-en-cefrj
- axolotl-ai-cloud/axolotl
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.