Transformers.js v4: State-of-the-art machine learning for the web
Summary
Transformers.js version 4 introduces a completely rewritten C++ WebGPU backend, significantly enhancing performance and enabling larger models like GP-OSS 20B to run in JavaScript across browsers, Node.js, Bun, and Deno. This new backend, developed in collaboration with the ONNX runtime team, offers improved operator coverage, better accuracy, and allows WebGPU operations in multiple languages. The update also expands model architecture support to over 200, including V4 exclusives like TranslateGemma for multilingual translation, LFM2VL for video captioning, and Voxtral Realtime for local speech recognition. Furthermore, it enables 8B+ parameter models, such as the 20B parameter GPT-OSS, to run efficiently in the browser at 40 tokens per second, leveraging custom Mixture of Experts (MoE) and QMoE operations. New features include a ModelRegistry for granular control over model files and environment settings for WASM caching and custom fetch functions, alongside tooling improvements like a modular codebase and faster builds.
Key takeaway
For AI Architects and NLP Engineers building web-based or server-side JavaScript applications, Transformers.js v4's new C++ WebGPU backend and expanded model support mean you can now deploy much larger, more performant models like GPT-OSS 20B directly in the browser or Node.js environments. Evaluate migrating to v4 to leverage improved inference speeds, broader model compatibility, and enhanced control over model loading and caching, potentially simplifying your deployment pipeline for advanced AI features.
Key insights
Transformers.js v4 significantly boosts performance and model scale via a new C++ WebGPU backend and expanded architecture support.
Principles
- WebGPU enables cross-platform AI acceleration.
- Fused kernels maximize LLM performance.
- MoE architectures facilitate large model inference.
Method
The new WebGPU runtime was rewritten in C++ and tested with the ONNX runtime team. Large language model architectures were reimplemented operation by operation, utilizing fused kernels and custom MoE/QMoE ops for efficiency.
In practice
- Run 20B parameter models like GPT-OSS in browser.
- Use ModelRegistry for model file visibility and caching.
- Integrate custom fetch functions for authenticated access.
Topics
- Transformers.js v4
- WebGPU Backend
- On-device AI Inference
- Large Language Models
- Model Architectures
Best for: AI Architect, NLP Engineer, AI Product Manager, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.