Transformers.js v4: State-of-the-art machine learning for the web

2026-03-30 · Source: HuggingFace · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, short

Summary

Transformers.js version 4 introduces a completely rewritten C++ WebGPU backend, significantly enhancing performance and enabling larger models like GP-OSS 20B to run in JavaScript across browsers, Node.js, Bun, and Deno. This new backend, developed in collaboration with the ONNX runtime team, offers improved operator coverage, better accuracy, and allows WebGPU operations in multiple languages. The update also expands model architecture support to over 200, including V4 exclusives like TranslateGemma for multilingual translation, LFM2VL for video captioning, and Voxtral Realtime for local speech recognition. Furthermore, it enables 8B+ parameter models, such as the 20B parameter GPT-OSS, to run efficiently in the browser at 40 tokens per second, leveraging custom Mixture of Experts (MoE) and QMoE operations. New features include a ModelRegistry for granular control over model files and environment settings for WASM caching and custom fetch functions, alongside tooling improvements like a modular codebase and faster builds.

Key takeaway

For AI Architects and NLP Engineers building web-based or server-side JavaScript applications, Transformers.js v4's new C++ WebGPU backend and expanded model support mean you can now deploy much larger, more performant models like GPT-OSS 20B directly in the browser or Node.js environments. Evaluate migrating to v4 to leverage improved inference speeds, broader model compatibility, and enhanced control over model loading and caching, potentially simplifying your deployment pipeline for advanced AI features.

Key insights

Transformers.js v4 significantly boosts performance and model scale via a new C++ WebGPU backend and expanded architecture support.

Principles

WebGPU enables cross-platform AI acceleration.
Fused kernels maximize LLM performance.
MoE architectures facilitate large model inference.

Method

The new WebGPU runtime was rewritten in C++ and tested with the ONNX runtime team. Large language model architectures were reimplemented operation by operation, utilizing fused kernels and custom MoE/QMoE ops for efficiency.

In practice

Run 20B parameter models like GPT-OSS in browser.
Use ModelRegistry for model file visibility and caching.
Integrate custom fetch functions for authenticated access.

Topics

Transformers.js v4
WebGPU Backend
On-device AI Inference
Large Language Models
Model Architectures

Best for: AI Architect, NLP Engineer, AI Product Manager, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.