How Roblox Uses AI to Translate 16 Languages in 100 Milliseconds

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

Roblox developed a single, unified transformer-based translation model to handle real-time chat translation across 16 languages, supporting 256 language pairs for its 70 million daily users. This system processes over 5,000 chats per second with a latency of approximately 100 milliseconds. The core architecture utilizes a Mixture of Experts (MoE) approach, where a routing mechanism activates specialized subnetworks for specific language pairs, allowing for a broad range of expertise without every request passing through all parameters. To achieve the required speed, Roblox employed knowledge distillation, quantization, and model compilation to reduce the model from 1 billion to under 650 million parameters. Further latency optimizations include a translation cache, dynamic batching, and an embedding cache between the encoder and decoder, significantly reducing redundant computations for multi-target translations.

Key takeaway

For NLP Engineers building real-time, high-scale translation systems, consider Roblox's approach of a single Mixture of Experts model combined with aggressive optimization techniques. While building a custom solution is complex, it can yield superior domain-specific accuracy and meet stringent latency requirements that commercial APIs might not. Evaluate the trade-offs between custom development and off-the-shelf solutions based on your specific scale, latency, and domain-accuracy needs.

Key insights

A single Mixture of Experts model can efficiently handle real-time translation across many language pairs at massive scale.

Principles

Unified models scale better than N*N models.
Distillation reduces model size for faster inference.
Caching and batching are critical for low-latency serving.

Method

Roblox built a unified MoE transformer model, then applied knowledge distillation, quantization, and compilation. They integrated caching and dynamic batching into the serving pipeline for real-time performance and developed a custom reference-free quality estimation model.

In practice

Consider MoE for multi-task or multi-language models.
Implement distillation for production model compression.
Utilize embedding caches for multi-target inference.

Topics

AI Translation
Mixture-of-Experts
Knowledge Distillation
Real-time Translation
Multilingual Models

Best for: Machine Learning Engineer, NLP Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.