Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters

2026-05-28 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

Liquid AI released LFM2.5-8B-A1B, an on-device Mixture-of-Experts (MoE) model, which activates only 1.5B of its 8.3B total parameters per token. This version introduces reasoning-only capabilities, producing an explicit chain of thought, and demonstrates a significant reduction in hallucination rates, with the Non-Hallucination Rate improving from 7.46 to 63.47, IFEval from 79.44 to 91.84, MATH500 from 74.80 to 88.76, and Tau² Telecom from 13.60 to 88.07. The model runs efficiently on various hardware, achieving 253 tok/s on an M5 Max under 6 GB, ~30 tok/s on a phone, and 18.5K tok/s on a single H100. It emphasizes tool calling, exemplified by the LocalCowork demo running 67 tools across 13 MCP servers on one laptop without cloud dependencies. The model offers open weights and day-one support for llama.cpp, MLX, vLLM, and SGLang.

Key takeaway

For AI engineers building local agents or on-device AI applications, you should consider LFM2.5-8B-A1B. Its efficient MoE architecture, combined with significantly reduced hallucination and robust tool-calling capabilities, offers a compelling solution for privacy-preserving, high-performance local AI. Evaluate its integration to develop agents that operate entirely on-device, minimizing cloud dependencies and data egress.

Key insights

LFM2.5-8B-A1B is an on-device MoE model that significantly reduces hallucinations and enables complex tool-calling.

Principles

MoE architecture enables cheap reasoning tokens.
Targeted RL reward can train models to abstain on unknown questions.

Method

A targeted avg@k RL reward trains the model to abstain on questions beyond its knowledge, significantly reducing hallucination rates.

In practice

Run 67 tools across 13 MCP servers on one laptop.
Utilize llama.cpp, MLX, vLLM, SGLang for deployment.

Topics

Mixture-of-Experts
On-device AI
Hallucination Reduction
Tool Calling
Local Agents
Model Inference

Best for: AI Architect, NLP Engineer, Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.