Transformers.js v4 Preview: Now Available on NPM!

2024-10-22 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

Hugging Face has released a preview of Transformers.js v4 on NPM, nearly a year after development began in March 2025. This version introduces a new WebGPU Runtime, rewritten in C++ and developed in collaboration with the ONNX Runtime team, enabling WebGPU-accelerated models to run across browsers, server-side runtimes like Node, Bun, and Deno, and desktop applications. Key performance enhancements include specialized ONNX Runtime Contrib Operators, which delivered a ~4x speedup for BERT-based embedding models. The update also brings full offline support via local WASM file caching. The repository has been restructured into a monorepo using pnpm workspaces, with a modular class structure for models and a dedicated examples repository. The build system migrated from Webpack to esbuild, reducing build times by 10x to 200 milliseconds and bundle sizes by an average of 10%, with the default export being 53% smaller. Transformers.js v4 also adds support for new models like GPT-OSS, Chatterbox, and FalconH1, and extracts tokenization logic into a standalone, lightweight @huggingface/tokenizers library.

Key takeaway

For NLP Engineers developing JavaScript-based AI applications, the Transformers.js v4 preview offers significant performance gains and broader deployment options. You should explore integrating the new WebGPU runtime for accelerated inference in browser, server, or desktop environments, and consider the standalone `@huggingface/tokenizers` library for lightweight tokenization. This update enables more efficient, offline-capable, and versatile model deployments, potentially reducing operational costs and improving user experience.

Key insights

Transformers.js v4 enhances performance and expands runtime compatibility through a new WebGPU runtime and optimized ONNX exports.

Principles

Modular design improves maintainability and extensibility.
Specialized operators accelerate model inference.
Offline capabilities enhance user experience.

Method

The new WebGPU runtime, rewritten in C++ and integrated with ONNX Runtime, enables cross-environment execution and leverages Contrib Operators for performance optimization.

In practice

Use `npm i @huggingface/transformers@next` to install.
Integrate `@huggingface/tokenizers` for standalone tokenization.
Run WebGPU-accelerated models in Node, Bun, or Deno.

Topics

Transformers.js
WebGPU Runtime
ONNX Runtime
Large Language Models
Tokenization Libraries

Code references

Best for: NLP Engineer, Machine Learning Engineer, Software Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.