Transformers.js v4 Preview: Now Available on NPM!
Summary
Hugging Face has released a preview of Transformers.js v4 on NPM, nearly a year after development began in March 2025. This version introduces a new WebGPU Runtime, rewritten in C++ and developed in collaboration with the ONNX Runtime team, enabling WebGPU-accelerated models to run across browsers, server-side runtimes like Node, Bun, and Deno, and desktop applications. Key performance enhancements include specialized ONNX Runtime Contrib Operators, which delivered a ~4x speedup for BERT-based embedding models. The update also brings full offline support via local WASM file caching. The repository has been restructured into a monorepo using pnpm workspaces, with a modular class structure for models and a dedicated examples repository. The build system migrated from Webpack to esbuild, reducing build times by 10x to 200 milliseconds and bundle sizes by an average of 10%, with the default export being 53% smaller. Transformers.js v4 also adds support for new models like GPT-OSS, Chatterbox, and FalconH1, and extracts tokenization logic into a standalone, lightweight @huggingface/tokenizers library.
Key takeaway
For NLP Engineers developing JavaScript-based AI applications, the Transformers.js v4 preview offers significant performance gains and broader deployment options. You should explore integrating the new WebGPU runtime for accelerated inference in browser, server, or desktop environments, and consider the standalone `@huggingface/tokenizers` library for lightweight tokenization. This update enables more efficient, offline-capable, and versatile model deployments, potentially reducing operational costs and improving user experience.
Key insights
Transformers.js v4 enhances performance and expands runtime compatibility through a new WebGPU runtime and optimized ONNX exports.
Principles
- Modular design improves maintainability and extensibility.
- Specialized operators accelerate model inference.
- Offline capabilities enhance user experience.
Method
The new WebGPU runtime, rewritten in C++ and integrated with ONNX Runtime, enables cross-environment execution and leverages Contrib Operators for performance optimization.
In practice
- Use `npm i @huggingface/transformers@next` to install.
- Integrate `@huggingface/tokenizers` for standalone tokenization.
- Run WebGPU-accelerated models in Node, Bun, or Deno.
Topics
- Transformers.js
- WebGPU Runtime
- ONNX Runtime
- Large Language Models
- Tokenization Libraries
Code references
Best for: NLP Engineer, Machine Learning Engineer, Software Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.