Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Intermediate, quick

Summary

A benchmark study evaluates lightweight Transformer models (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) against traditional machine learning methods (Random Forest, XGBoost, SVM, Logistic Regression) for on-device binary fault detection. The research focuses on resource-constrained deployment, assessing performance across NASA C-MAPSS, SECOM, and UCI AI4I 2020 datasets. Key metrics include F1-score, AUC, model size, and CPU inference latency. On well-separated sensor data like C-MAPSS, lightweight transformers achieved an 87.8% F1-score, comparable to traditional ML, but with 100x larger model sizes and 9000x higher latency. TinyBERT-4L was identified as the most deployment-friendly transformer at 55 MB and 18 ms CPU latency. The study also found that INT8 dynamic quantization reduced model size by 25% while maintaining an 86.9% F1-score. An adaptive inference pipeline, routing 97.9% of predictions through a quantized triage model, achieved 87.6% F1 at 19.5 ms average latency. Both approaches struggled significantly on severely imbalanced datasets.

Key takeaway

For AI Engineers deploying fault detection models on resource-constrained edge devices, you should carefully weigh the significant resource overhead of lightweight Transformers against their performance. While TinyBERT-4L offers a viable option at 55 MB and 18 ms latency, consider INT8 quantization to reduce model size by 25%. Implement an adaptive inference pipeline for optimal latency-accuracy balance. Be aware that current methods struggle severely with highly imbalanced datasets, necessitating robust data preprocessing or alternative strategies.

Key insights

Lightweight Transformers can match traditional ML for on-device fault detection but demand significant resource tradeoffs, especially on imbalanced data.

Principles

Method

The study proposes a two-stage adaptive inference pipeline: route 97.9% of predictions through a quantized triage model, and 2.1% to a larger expert model for complex cases.

In practice

Topics

Best for: AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.