Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition
Summary
Dango, a new 1.8B-parameter large language model, is introduced for controlled studies of Japanese-to-English (L1-to-L2) transfer in second language acquisition (SLA). Unlike prior smaller or non-decoder models, Dango is designed for open-ended text generation, making it a more practical L2 simulator. A critical challenge addressed is L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. The researchers developed a filtering method to minimize premature English exposure while preserving realistic, minimal contact. Subsequently, Dango is fine-tuned using LLM-generated L2-learning lessons to accurately simulate the L2 acquisition process. Evaluations confirm Dango's ability to develop human-like L2 production patterns, surpassing both unfiltered and standard multilingual baseline models. The project includes the release of the model, data, and code to support reproducible computational SLA research and learner-facing applications.
Key takeaway
For AI Scientists and Research Scientists focused on second language acquisition modeling, Dango offers a robust, controlled environment. If you are developing L2 simulators, consider implementing data filtering techniques to prevent L2 contamination in L1 pretraining. This approach enables more accurate simulations of human-like L2 production patterns, outperforming standard multilingual baselines. You should explore Dango's released model and data to advance reproducible computational SLA research and develop new learner-facing applications.
Key insights
Dango is a 1.8B-parameter L1-only LLM designed for controlled L1-to-L2 transfer studies in second language acquisition.
Principles
- L2 contamination in L1 pretraining data is a key challenge for large SLA models.
- Filtering pretraining data can control L2 exposure for L1-only models.
- LLM-generated lessons can simulate L2 acquisition effectively.
Method
A filtering method reduces premature L2 exposure in L1 pretraining data. The model is then fine-tuned on LLM-generated L2-learning lessons to simulate L2 acquisition.
In practice
- Use data filtering to create L1-only pretraining corpora for SLA models.
- Employ LLM-generated lessons for targeted L2 acquisition fine-tuning.
- Leverage Dango's resources for computational SLA research.
Topics
- Second Language Acquisition
- Large Language Models
- L1-to-L2 Transfer
- Data Filtering
- Japanese-to-English
- Computational Linguistics
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.