Why My Coding Assistant Started Replying in Korean When I Typed Chinese

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, short

Summary

A coding assistant, when prompted in mixed Chinese and English engineering terms, consistently responded in Korean, despite the user not being a Korean speaker. This phenomenon led to a hypothesis that embedding spaces are structured by task registers (e.g., engineering/code) rather than natural languages, with engineering English dominating technical corpora. Experiments were conducted using controlled language drift, gradually introducing English engineering terms into Chinese sentences. Cosine similarity measurements showed a non-linear increase in English similarity, suggesting a phase-transition-like behavior as embeddings moved between attractor basins. Further analysis of a real-world prompt and its Korean response, translated back to Chinese, revealed that while language form was restored, the embedding location remained closer to English clusters, reinforcing the idea that embedding spaces are organized by task nature rather with engineering English as a dominant attractor.

Key takeaway

For AI Engineers developing or deploying multilingual coding assistants, understand that embedding spaces may prioritize technical registers over natural language boundaries. Your models might exhibit unexpected language shifts, like replying in Korean when prompted in mixed Chinese-English, because the underlying embedding space is drawn to a dominant "engineering English" attractor. Account for this by explicitly fine-tuning models on diverse, task-specific multilingual datasets to ensure consistent and expected language outputs.

Key insights

Embedding spaces are structured by task registers, not language, with engineering English dominating technical contexts.

Principles

Method

Controlled language drift experiments were performed by gradually replacing Chinese words with English engineering terms in sentences, then measuring cosine similarity to English and Korean embedding clusters.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.