Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition
Summary
Dango is a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). This model addresses a key challenge in scaling such models: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To mitigate this, the researchers propose a filtering method that reduces premature exposure to English while preserving realistic, minimal exposure. Subsequently, Dango is fine-tuned on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. The model, data, and code are released to facilitate reproducible computational SLA research and learner-facing applications.
Key takeaway
For research scientists studying second language acquisition with large language models, Dango offers a robust, controlled environment. You should consider integrating this 1.8B-parameter L1-only model to mitigate L2 contamination issues and achieve more human-like L2 production patterns in your simulations. Its released data and code facilitate reproducible studies, allowing you to explore L1-to-L2 transfer dynamics more accurately.
Key insights
Dango, a 1.8B-parameter LLM, enables controlled study of L1-to-L2 transfer in SLA by mitigating L2 contamination and simulating acquisition.
Principles
- L2 contamination is a key challenge in scaling L1-only LLMs.
- Controlled L2 exposure is crucial for realistic SLA simulation.
- LLM-generated lessons can simulate L2 acquisition processes.
Method
A filtering method reduces L2 contamination in L1 pretraining corpora, followed by fine-tuning on LLM-generated L2-learning lessons to simulate second language acquisition.
In practice
- Use Dango for reproducible computational SLA research.
- Develop learner-facing L2 acquisition applications.
- Apply filtering to create L1-only pretraining datasets.
Topics
- Large Language Models
- Second Language Acquisition
- L1-to-L2 Transfer
- Data Contamination
- Computational Linguistics
- Japanese-to-English
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.