From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

2026-05-18 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

EPIC (Efficient Preference-aligned Index Construction) is a novel method designed for on-device Large Language Model (LLM) agents to manage personal context under tight memory constraints. It addresses the challenge of selectively storing information to ensure retrieval aligns with user preferences, which are treated as a compact and stable form of personal context. EPIC integrates these preferences throughout the Retrieval-Augmented Generation (RAG) pipeline, retaining only preference-relevant data and aligning retrieval accordingly. Benchmarking across conversations, debates, explanations, and recommendations shows EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 20.17 percentage points, and achieves 33.33 times lower retrieval latency compared to the best baseline. An on-device experiment demonstrated EPIC maintaining a memory footprint under 1 MB with 29.35 ms/query latency during streaming updates.

Key takeaway

For NLP Engineers developing on-device LLM agents that require personalized context, adopting a preference-aligned indexing approach like EPIC can drastically reduce memory footprint and improve retrieval latency. You should consider integrating user preferences directly into your RAG pipeline to enhance both efficiency and the accuracy of contextually relevant responses, especially in privacy-sensitive applications.

Key insights

On-device LLM agents can use preference-aligned indexing to optimize memory and retrieval for personal context.

Principles

User preferences are stable personal context.
Integrate preferences throughout the RAG pipeline.

Method

EPIC selectively retains preference-relevant data from raw input and aligns retrieval towards preference-aligned contexts to optimize on-device RAG performance.

In practice

Reduce indexing memory by 2,404 times.
Improve preference-following accuracy by 20.17%.
Achieve 29.35 ms/query latency on-device.

Topics

On-Device RAG
Preference-Aligned Memory
EPIC Index Construction
Large Language Models
Memory Efficiency

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.