Scaling Down Is the New Scaling Up
Summary
Meta Reality Labs' senior director, Vikas Chandra, presented work at the Embedded Vision Summit on enabling advanced perception and agentic AI on resource-constrained personal devices like smartphones and wearables. Chandra emphasized "scaling down" models, focusing on hardware-aware design rather than just increasing model size. Key breakthroughs include quantization techniques like SpinQuant (ICLR 2025), which allows quantization below 4 bits without accuracy loss, and architectural optimizations such as MobileLLM (ICML 2024), which found tall and narrow models perform better for sub-1B parameters. Runtime optimizations like speculative decoding reduce latency by 2-3x, crucial for responsive agents. Additionally, vision models like EdgeTAM (running at 16fps on iPhone 15 Pro Max) and LongVU (ICML 2025) enable efficient video understanding and 3D spatial awareness (DepthLM) on edge processors. This collective effort aims to create always-present, private personal agents.
Key takeaway
For AI Engineers developing on-device agents, prioritize hardware-aware model design from the outset. Your focus should shift from maximizing model size to optimizing for strict compute, memory, and power budgets. Implement techniques like sub-4-bit quantization, tall and narrow architectures, and speculative decoding to achieve responsive, efficient performance. This approach enables persistent, private agentic AI experiences on wearables and smartphones, moving beyond cloud-dependent chatbots.
Key insights
Future agentic AI prioritizes hardware-aware design for efficient on-device operation over raw model size.
Principles
- Bigger models at lower precision can outperform smaller models at higher precision for fixed memory.
- Tall and narrow model architectures are more efficient for on-device LLMs.
- Persistent context on edge devices requires extreme efficiency.
Method
Optimize on-device AI through four areas: quantization (e.g., learned smoothing for sub-4-bit precision), architecture (e.g., tall/narrow models, hardware-in-the-loop training), runtime (e.g., speculative decoding), and efficient multimodal vision (e.g., redundant frame reduction).
In practice
- Explore sub-4-bit quantization with learned smoothing for memory-constrained models.
- Design LLM architectures with more layers and smaller widths for edge deployment.
- Implement speculative decoding to reduce agent response latency by 2-3x.
Topics
- Agentic AI
- Edge AI
- Model Quantization
- LLM Architecture
- Speculative Decoding
- Embedded Vision
Best for: AI Architect, Computer Vision Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Big Data & AI News - EE Times.