Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

2025-12-05 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies, Research Methodology & Innovation · Depth: Expert, extended

Summary

Dango is a 1.8B-parameter decoder-only large language model developed for controlled studies of Japanese-to-English (L1→L2) transfer in second language acquisition (SLA). Addressing limitations of smaller or encoder-only models in prior work, Dango tackles the critical challenge of L2 contamination within its "monolingual" Japanese pretraining corpus. Researchers implemented a filtering method to minimize premature English exposure while preserving natural, minimal L2 signals. The model was then fine-tuned on LLM-generated L2-learning lessons, simulating the acquisition process. Evaluations confirm Dango exhibits human-like L2 production patterns, surpassing unfiltered and standard multilingual baselines, and aligning with prompted GPT-5.5. The model, data, and code are released to support reproducible computational SLA research.

Key takeaway

For research scientists or NLP engineers investigating language transfer or building specialized language models, you should prioritize strict control over pretraining data to prevent L2 contamination. Dango demonstrates that carefully filtered L1 corpora, combined with structured L2 fine-tuning, can yield models exhibiting human-like acquisition patterns. Consider adopting similar data filtering and LLM-generated lesson strategies to create robust, interpretable L2 simulators for computational SLA research or learner-facing applications.

Key insights

Dango is a 1.8B-parameter LLM for controlled L1→L2 transfer studies, using filtered L1 data and LLM-generated L2 lessons.

Principles

L2 contamination impacts L1-only pretraining.
Strict data filtering enables controlled SLA studies.
LLM-generated lessons simulate L2 acquisition.

Method

Pretrain a 1.8B Llama-2-style decoder-only LLM on a strictly filtered L1 corpus, then fine-tune on LLM-generated, progressively difficult L2 lessons.

In practice

Utilize the released Dango model for SLA research.
Apply data filtering to create L1-only corpora.
Employ LLM-as-a-judge for L2 output evaluation.

Topics

Second Language Acquisition
Language Transfer
Large Language Models
Data Filtering
L1 Pretraining
Japanese-English

Code references

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.