I Crammed RAG, a Vector Database, and a Gemma LLM into a Mobile App. Here’s What Happened.
Summary
The "Smart Notes" mobile application integrates Retrieval Augmented Generation (RAG), a local vector database, and a Gemma Large Language Model (LLM) to operate entirely on-device without cloud dependencies after initial model download. This Flutter app utilizes Gemma 4 E4B IT (~4.3 GB) for generative inference and EmbeddingGemma for 768-dimensional vector embeddings. Notes undergo a four-step pipeline: chunking into ~480-token segments, embedding, prepending context like titles and timestamps, and storing in ObjectBox with an HNSW index. For queries, the app employs a hybrid retrieval system combining dense cosine similarity with BM25 re-ranking, fused with a 0.7 dense + 0.3 BM25 weighted sum, and Maximal Marginal Relevance (MMR) for diversity. A unique graph view visualizes semantic connections between notes using mean vector similarities (threshold 0.35). The project highlights the viability of on-device LLMs for privacy-preserving applications.
Key takeaway
For mobile developers building privacy-focused AI applications, this project demonstrates that fully on-device RAG with LLMs is achievable and robust. You should consider local inference engines like `flutter_gemma` and embedded vector databases such as ObjectBox to ensure user data never leaves the device. This approach offers strong privacy guarantees and enables unique features like semantic graph views, fostering innovation beyond cloud-dependent solutions.
Key insights
On-device LLMs and RAG are viable for privacy-preserving mobile applications, eliminating cloud dependencies.
Principles
- On-device AI enables true data privacy.
- Hybrid retrieval improves semantic and keyword search.
- Strict constraints can drive innovative solutions.
Method
The app chunks notes (~480 tokens), embeds them into 768-dim vectors, prepends context, and stores them in ObjectBox with HNSW. Queries use hybrid retrieval (dense + BM25, 0.7/0.3 fusion) and MMR for diverse context.
In practice
- Use LiteRT bundles for on-device LLMs.
- Implement HNSW for fast vector search.
- Tune similarity thresholds for graph views.
Topics
- On-device AI
- Retrieval-Augmented Generation
- Gemma LLM
- Vector Databases
- Flutter Development
- Data Privacy
Code references
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.