MLCommons Releases MLPerf Mobile v6.0 with New Generative AI Benchmarks for On-Device LLMs

· Source: MLCommons · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices · Depth: Intermediate, quick

Summary

MLCommons has released MLPerf Mobile v6.0, introducing new generative AI benchmark tests specifically designed for running large language models (LLMs) on Android devices. This update expands the existing MLPerf Mobile app's comprehensive suite, which already includes benchmarks for image generation, object detection, and super resolution. The new LLM benchmarks utilize Llama 3.2 1B Instruct, Llama 3.2 3B Instruct, and Llama 3.1 8B Instruct models, evaluating their performance and accuracy using requests from the TinyMMLU and IFEval datasets. While LLM tests can run on devices with sufficient memory via CPU, the release also supports NPU-accelerated execution for the Llama 3.1 8B Instruct model on Qualcomm Snapdragon 8 Elite Gen 5 SoCs. Furthermore, v6.0 adds support for MediaTek Dimensity 9500 Series devices and updates support for Qualcomm Snapdragon 8 Elite Gen 5 and Samsung Exynos 2600 chips. The MLPerf Mobile app is openly available on Google Play, the Apple App Store, and GitHub under the Apache 2.0 license.

Key takeaway

For Machine Learning Engineers evaluating on-device LLM deployment, MLPerf Mobile v6.0 provides a critical new tool. You can now directly benchmark Llama 3.x Instruct models on Android, assessing performance on both CPU and NPU-accelerated hardware like Qualcomm Snapdragon 8 Elite Gen 5 SoCs. This enables informed decisions on model selection and hardware optimization for mobile generative AI applications. Consider integrating these benchmarks into your development pipeline to validate mobile AI inference efficiency.

Key insights

MLPerf Mobile v6.0 introduces on-device LLM benchmarks, standardizing mobile generative AI performance measurement.

Principles

Method

MLPerf Mobile v6.0 benchmarks LLMs by running Llama 3.x Instruct models on TinyMMLU and IFEval datasets, evaluating CPU and NPU performance.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, AI Hardware Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLCommons.