I Crammed RAG, a Vector Database, and a Gemma LLM into a Mobile App. Here’s What Happened.

2026-05-20 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

The "Smart Notes" mobile application integrates Retrieval Augmented Generation (RAG), a local vector database, and a Gemma Large Language Model (LLM) to operate entirely on-device without cloud dependencies after initial model download. This Flutter app utilizes Gemma 4 E4B IT (~4.3 GB) for generative inference and EmbeddingGemma for 768-dimensional vector embeddings. Notes undergo a four-step pipeline: chunking into ~480-token segments, embedding, prepending context like titles and timestamps, and storing in ObjectBox with an HNSW index. For queries, the app employs a hybrid retrieval system combining dense cosine similarity with BM25 re-ranking, fused with a 0.7 dense + 0.3 BM25 weighted sum, and Maximal Marginal Relevance (MMR) for diversity. A unique graph view visualizes semantic connections between notes using mean vector similarities (threshold 0.35). The project highlights the viability of on-device LLMs for privacy-preserving applications.

Key takeaway

For mobile developers building privacy-focused AI applications, this project demonstrates that fully on-device RAG with LLMs is achievable and robust. You should consider local inference engines like `flutter_gemma` and embedded vector databases such as ObjectBox to ensure user data never leaves the device. This approach offers strong privacy guarantees and enables unique features like semantic graph views, fostering innovation beyond cloud-dependent solutions.

Key insights

On-device LLMs and RAG are viable for privacy-preserving mobile applications, eliminating cloud dependencies.

Principles

On-device AI enables true data privacy.
Hybrid retrieval improves semantic and keyword search.
Strict constraints can drive innovative solutions.

Method

The app chunks notes (~480 tokens), embeds them into 768-dim vectors, prepends context, and stores them in ObjectBox with HNSW. Queries use hybrid retrieval (dense + BM25, 0.7/0.3 fusion) and MMR for diverse context.

In practice

Use LiteRT bundles for on-device LLMs.
Implement HNSW for fast vector search.
Tune similarity thresholds for graph views.

Topics

On-device AI
Retrieval-Augmented Generation
Gemma LLM
Vector Databases
Flutter Development
Data Privacy

Code references

Afridee/smart_notes

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.