MLCommons Releases MLPerf Mobile v6.0 with New Generative AI Benchmarks for On-Device LLMs
Summary
MLCommons has released MLPerf Mobile v6.0, introducing new generative AI benchmark tests specifically designed for running large language models (LLMs) on Android devices. This update expands the existing MLPerf Mobile app's comprehensive suite, which already includes benchmarks for image generation, object detection, and super resolution. The new LLM benchmarks utilize Llama 3.2 1B Instruct, Llama 3.2 3B Instruct, and Llama 3.1 8B Instruct models, evaluating their performance and accuracy using requests from the TinyMMLU and IFEval datasets. While LLM tests can run on devices with sufficient memory via CPU, the release also supports NPU-accelerated execution for the Llama 3.1 8B Instruct model on Qualcomm Snapdragon 8 Elite Gen 5 SoCs. Furthermore, v6.0 adds support for MediaTek Dimensity 9500 Series devices and updates support for Qualcomm Snapdragon 8 Elite Gen 5 and Samsung Exynos 2600 chips. The MLPerf Mobile app is openly available on Google Play, the Apple App Store, and GitHub under the Apache 2.0 license.
Key takeaway
For Machine Learning Engineers evaluating on-device LLM deployment, MLPerf Mobile v6.0 provides a critical new tool. You can now directly benchmark Llama 3.x Instruct models on Android, assessing performance on both CPU and NPU-accelerated hardware like Qualcomm Snapdragon 8 Elite Gen 5 SoCs. This enables informed decisions on model selection and hardware optimization for mobile generative AI applications. Consider integrating these benchmarks into your development pipeline to validate mobile AI inference efficiency.
Key insights
MLPerf Mobile v6.0 introduces on-device LLM benchmarks, standardizing mobile generative AI performance measurement.
Principles
- Standardized benchmarks drive mobile AI development.
- On-device LLM performance is now measurable.
- Open-source tools foster broader adoption.
Method
MLPerf Mobile v6.0 benchmarks LLMs by running Llama 3.x Instruct models on TinyMMLU and IFEval datasets, evaluating CPU and NPU performance.
In practice
- Benchmark Llama 3.x models on Android.
- Evaluate NPU acceleration on Snapdragon 8 Elite Gen 5.
- Access open-source MLPerf Mobile code.
Topics
- MLPerf Mobile
- On-device LLMs
- Generative AI Benchmarks
- Android AI Inference
- Qualcomm Snapdragon
- Llama 3 Models
Code references
Best for: AI Engineer, NLP Engineer, AI Hardware Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLCommons.