Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Intermediate, quick

Summary

A benchmark study evaluates lightweight Transformer models (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) against traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) for on-device binary fault detection. The research focuses on resource-constrained deployment, assessing performance across NASA C-MAPSS, SECOM, and UCI AI4I 2020 datasets. Key metrics include F1-score, AUC, model size, and CPU inference latency. On well-separated sensor data like C-MAPSS, lightweight transformers achieved an 87.8% F1-score, comparable to traditional ML, but with 100x larger model sizes and 9000x higher latency. TinyBERT-4L was identified as the most deployment-friendly transformer at 55 MB and 18 ms CPU latency. The study also found that INT8 dynamic quantization reduced model size by 25% while maintaining an 86.9% F1-score. An adaptive inference pipeline, routing 97.9% of predictions through a quantized triage model, achieved 87.6% F1 at 19.5 ms average latency. Both approaches struggled significantly on severely imbalanced datasets.

Key takeaway

For AI Engineers deploying fault detection models on resource-constrained edge devices, you should carefully weigh the significant resource overhead of lightweight Transformers against their performance. While TinyBERT-4L offers a viable option at 55 MB and 18 ms latency, consider INT8 quantization to reduce model size by 25%. Implement an adaptive inference pipeline for optimal latency-accuracy balance. Be aware that current methods struggle severely with highly imbalanced datasets, necessitating robust data preprocessing or alternative strategies.

Key insights

Lightweight Transformers can match traditional ML for on-device fault detection but demand significant resource tradeoffs, especially on imbalanced data.

Principles

Resource constraints demand careful model tradeoffs.
Quantization reduces size with minimal accuracy loss.
Extreme class imbalance challenges all fault detection.

Method

The study proposes a two-stage adaptive inference pipeline: route 97.9% of predictions through a quantized triage model, and 2.1% to a larger expert model for complex cases.

In practice

Deploy TinyBERT-4L for on-device fault detection.
Use INT8 quantization to shrink model size.
Implement adaptive inference for latency control.

Topics

Lightweight Transformers
On-device AI
Fault Detection
Model Quantization
Adaptive Inference
Predictive Maintenance

Best for: AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.