Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices · Depth: Expert, medium

Summary

A cascaded multi-granularity pruning framework is introduced for deploying large language models (LLMs) on Industrial Internet of Things (IIoT) edge devices, addressing the need for extreme compression. This method removes layers, attention heads, and feed-forward channels in a coarse-to-fine sequence, incorporating lightweight low-rank recovery between stages to refine component importance estimation. An information-theoretic analysis guides this ordering, and the Structural Independence Assumption (SIA) is formalized to predict pruning reliability across architectures. The framework achieves 13.8 times compression on MHA+GELU architectures, maintaining 83.82% accuracy (+3.70 percentage points over baselines) on bearing fault diagnosis for models ranging from 88M to 6.25B parameters. However, it reveals a ~74pp accuracy collapse on GQA+SwiGLU designs, which violate the SIA. Deployment on an industrial slewing bearing fault diagnosis platform with NVIDIA DGX Spark demonstrated up to 67.2% reduction in inference latency and 62.5% reduction in peak memory.

Key takeaway

For Machine Learning Engineers deploying LLMs on resource-constrained IIoT edge devices, you should consider cascaded multi-granularity pruning to achieve extreme model compression. This method can significantly reduce inference latency by up to 67.2% and peak memory by 62.5% on MHA+GELU architectures. However, be aware that GQA+SwiGLU designs may experience a substantial accuracy collapse, as they violate the Structural Independence Assumption, making them less suitable for this pruning approach.

Key insights

Cascaded multi-granularity pruning with low-rank recovery enables extreme LLM compression for IIoT, guided by architectural independence.

Principles

Method

The framework applies cascaded multi-granularity pruning, removing layers, attention heads, then feed-forward channels, with low-rank recovery between stages to re-estimate component importance.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.