How Roblox Uses AI to Translate 16 Languages in 100 Milliseconds
Summary
Roblox developed a single, unified transformer-based translation model to handle real-time chat translation across 16 languages, supporting 256 language pairs for its 70 million daily users. This system processes over 5,000 chats per second with a latency of approximately 100 milliseconds. The core architecture utilizes a Mixture of Experts (MoE) approach, where a routing mechanism activates specialized subnetworks for specific language pairs, allowing for a broad range of expertise without every request passing through all parameters. To achieve the required speed, Roblox employed knowledge distillation, quantization, and model compilation to reduce the model from 1 billion to under 650 million parameters. Further latency optimizations include a translation cache, dynamic batching, and an embedding cache between the encoder and decoder, significantly reducing redundant computations for multi-target translations.
Key takeaway
For NLP Engineers building real-time, high-scale translation systems, consider Roblox's approach of a single Mixture of Experts model combined with aggressive optimization techniques. While building a custom solution is complex, it can yield superior domain-specific accuracy and meet stringent latency requirements that commercial APIs might not. Evaluate the trade-offs between custom development and off-the-shelf solutions based on your specific scale, latency, and domain-accuracy needs.
Key insights
A single Mixture of Experts model can efficiently handle real-time translation across many language pairs at massive scale.
Principles
- Unified models scale better than N*N models.
- Distillation reduces model size for faster inference.
- Caching and batching are critical for low-latency serving.
Method
Roblox built a unified MoE transformer model, then applied knowledge distillation, quantization, and compilation. They integrated caching and dynamic batching into the serving pipeline for real-time performance and developed a custom reference-free quality estimation model.
In practice
- Consider MoE for multi-task or multi-language models.
- Implement distillation for production model compression.
- Utilize embedding caches for multi-target inference.
Topics
- AI Translation
- Mixture-of-Experts
- Knowledge Distillation
- Real-time Translation
- Multilingual Models
Best for: Machine Learning Engineer, NLP Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.