TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

2026-05-03 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices, Robotics & Autonomous Systems · Depth: Intermediate, extended

Summary

Google AI Edge is actively developing and deploying AI models on edge devices, focusing on tiny LLMs and agent skills. Cormbrick, a tech lead at Google, detailed the benefits of edge AI, including reduced latency, enhanced privacy, offline functionality, and cost savings. The team utilizes tools like MediaPipe, LiteRT-LM, and LiteRT (formerly TensorFlow Lite) to enable cross-platform deployment of models on Android, iOS, macOS, Linux, Windows, web, and IoT devices. Recently, Gemma 4 models (E2B and E4B, optimized for 2 billion and 4 billion parameters respectively) were launched, featuring built-in function calling and thinking capabilities, which enable on-device agent skills. These models support multimodal inputs (audio, image, text) and are released under an Apache 2.0 license. Google AI Gallery, an open-source app, showcases these agent skills, allowing dynamic tool access and integration of domain-specific knowledge bases. The presentation also covered the workflow for tiny models (under 1 billion parameters) for in-app deployment, emphasizing fine-tuning with synthetic data and NPU optimization.

Key takeaway

For AI Architects and NLP Engineers evaluating on-device AI solutions, the advancements in Gemma 4 and tiny LLMs present compelling opportunities. Your teams should consider leveraging the Apache 2.0 licensed Gemma 4 models for system-level GenAI and exploring fine-tuned tiny models for in-app features requiring privacy, low latency, or offline capability. Utilize the Google AI Gallery app and its open-source infrastructure for prototyping and deploying custom agent skills, especially for robotics or IoT platforms where hot-swapping LoRAs can optimize memory usage.

Key insights

Edge AI enables low-latency, private, and cost-effective on-device model deployment, leveraging tiny LLMs and agent skills.

Principles

Prioritize token efficiency for edge models.
Modularity in tiny models optimizes resource use.
Fine-tuning significantly boosts tiny model reliability.

Method

Agent skills are built with on-demand loaded instructions and dynamic tool access, using a system prompt, skill descriptions, and JavaScript for execution. Tiny models are fine-tuned with synthetic data generated by larger cloud LLMs.

In practice

Use Gemma 4 E2B/E4B for on-device agent skills.
Explore Google AI Gallery for skill prototyping.
Fine-tune tiny LLMs for specific in-app tasks.

Topics

Google AI Edge
LiteRT-LM Runtime
Gemma Models
Edge AI Deployment
Agent Skills

Best for: AI Architect, NLP Engineer, CTO, AI Engineer, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.