Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
Summary
A benchmark study evaluates lightweight Transformer models (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) against traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) for on-device binary fault detection. The research focuses on resource-constrained deployment, assessing performance across NASA C-MAPSS, SECOM, and UCI AI4I 2020 datasets. Key metrics include F1-score, AUC, model size, and CPU inference latency. On well-separated sensor data like C-MAPSS, lightweight transformers achieved an 87.8% F1-score, comparable to traditional ML, but with 100x larger model sizes and 9000x higher latency. TinyBERT-4L was identified as the most deployment-friendly transformer at 55 MB and 18 ms CPU latency. The study also found that INT8 dynamic quantization reduced model size by 25% while maintaining an 86.9% F1-score. An adaptive inference pipeline, routing 97.9% of predictions through a quantized triage model, achieved 87.6% F1 at 19.5 ms average latency. Both approaches struggled significantly on severely imbalanced datasets.
Key takeaway
For AI Engineers deploying fault detection models on resource-constrained edge devices, you should carefully weigh the significant resource overhead of lightweight Transformers against their performance. While TinyBERT-4L offers a viable option at 55 MB and 18 ms latency, consider INT8 quantization to reduce model size by 25%. Implement an adaptive inference pipeline for optimal latency-accuracy balance. Be aware that current methods struggle severely with highly imbalanced datasets, necessitating robust data preprocessing or alternative strategies.
Key insights
Lightweight Transformers can match traditional ML for on-device fault detection but demand significant resource tradeoffs, especially on imbalanced data.
Principles
- Resource constraints demand careful model tradeoffs.
- Quantization reduces size with minimal accuracy loss.
- Extreme class imbalance challenges all fault detection.
Method
The study proposes a two-stage adaptive inference pipeline: route 97.9% of predictions through a quantized triage model, and 2.1% to a larger expert model for complex cases.
In practice
- Deploy TinyBERT-4L for on-device fault detection.
- Use INT8 quantization to shrink model size.
- Implement adaptive inference for latency control.
Topics
- Lightweight Transformers
- On-device AI
- Fault Detection
- Model Quantization
- Adaptive Inference
- Predictive Maintenance
Best for: AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.