Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies · Depth: Expert, quick

Summary

Dango is a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). This model addresses a key challenge in scaling such models: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To mitigate this, the researchers propose a filtering method that reduces premature exposure to English while preserving realistic, minimal exposure. Subsequently, Dango is fine-tuned on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. The model, data, and code are released to facilitate reproducible computational SLA research and learner-facing applications.

Key takeaway

For research scientists studying second language acquisition with large language models, Dango offers a robust, controlled environment. You should consider integrating this 1.8B-parameter L1-only model to mitigate L2 contamination issues and achieve more human-like L2 production patterns in your simulations. Its released data and code facilitate reproducible studies, allowing you to explore L1-to-L2 transfer dynamics more accurately.

Key insights

Dango, a 1.8B-parameter LLM, enables controlled study of L1-to-L2 transfer in SLA by mitigating L2 contamination and simulating acquisition.

Principles

Method

A filtering method reduces L2 contamination in L1 pretraining corpora, followed by fine-tuning on LLM-generated L2-learning lessons to simulate second language acquisition.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.