Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Expert, extended

Summary

FLAME (Frequency-aware Latency Analysis for Mobile Edge) is a novel framework designed to accurately estimate AI model inference latency on mobile edge devices, such as NVIDIA Jetson series, under Dynamic Voltage and Frequency Scaling (DVFS). Traditional static profiling methods are rendered invalid by DVFS, and exhaustive profiling is prohibitively expensive, taking days for Small Language Models (SLMs) with variable context lengths. FLAME addresses this by employing a layer-wise modeling approach that quantifies asynchronous CPU-GPU coupling and aggregates dynamic pipeline bubbles across the full model. This bottom-up design generalizes across diverse models, from Deep Neural Networks (DNNs) to SLMs, and significantly reduces profiling overhead from hours/days to mere minutes, while maintaining estimation errors below 8.14%. The framework also enables a deadline-aware DVFS strategy, which outperforms state-of-the-art learning-based approaches like zTT in power efficiency by 23.48% and latency guarantees by 4.35%.

Key takeaway

For MLOps engineers deploying AI models on mobile edge devices with DVFS, FLAME offers a robust solution to overcome the limitations of static profiling and the prohibitive cost of exhaustive methods. You should consider integrating FLAME's layer-wise and model-wise estimation to achieve accurate, frequency-aware latency predictions. This enables more efficient resource management and allows for the implementation of deadline-aware DVFS strategies, significantly improving power efficiency and ensuring latency guarantees for time-critical applications.

Key insights

Precise latency estimation on mobile edge devices requires accounting for dynamic CPU-GPU asynchronous coupling under DVFS.

Principles

Method

FLAME uses layer-wise latency estimation with a piecewise model for CPU-GPU interaction, then aggregates these into a full model timeline, and applies online adaptation for real-time calibration.

In practice

Topics

Best for: MLOps Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.