Why My Coding Assistant Started Replying in Korean When I Typed Chinese
Summary
A coding assistant, when prompted in mixed Chinese and English engineering terms, consistently responded in Korean, despite the user not being a Korean speaker. This phenomenon led to a hypothesis that embedding spaces are structured by task registers (e.g., engineering/code) rather than natural languages, with engineering English dominating technical corpora. Experiments were conducted using controlled language drift, gradually introducing English engineering terms into Chinese sentences. Cosine similarity measurements showed a non-linear increase in English similarity, suggesting a phase-transition-like behavior as embeddings moved between attractor basins. Further analysis of a real-world prompt and its Korean response, translated back to Chinese, revealed that while language form was restored, the embedding location remained closer to English clusters, reinforcing the idea that embedding spaces are organized by task nature rather with engineering English as a dominant attractor.
Key takeaway
For AI Engineers developing or deploying multilingual coding assistants, understand that embedding spaces may prioritize technical registers over natural language boundaries. Your models might exhibit unexpected language shifts, like replying in Korean when prompted in mixed Chinese-English, because the underlying embedding space is drawn to a dominant "engineering English" attractor. Account for this by explicitly fine-tuning models on diverse, task-specific multilingual datasets to ensure consistent and expected language outputs.
Key insights
Embedding spaces are structured by task registers, not language, with engineering English dominating technical contexts.
Principles
- Embedding spaces exhibit attractor basins.
- Task registers can override language boundaries.
Method
Controlled language drift experiments were performed by gradually replacing Chinese words with English engineering terms in sentences, then measuring cosine similarity to English and Korean embedding clusters.
In practice
- Monitor model output for unexpected language shifts.
- Consider task-specific embedding space biases.
Topics
- Coding Assistant
- Embedding Space
- Language Drift
- Multilingual NLP
- Engineering Register
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.