RCEM: Robust Conversational Search EMbedder in Distributional Shift
Summary
RCEM is a novel conversational dense retrieval model designed to enhance Retrieval-Augmented Generation (RAG) systems by distilling Large Language Model (LLM) query reformulation capabilities directly into its embedding model. This approach enables context-aware retrieval without requiring explicit query rewriting during inference, significantly reducing latency and computational cost. Unlike prior methods that learn direct conversation-to-document matching, RCEM aligns conversational-query embeddings with rewritten-query embeddings, improving robustness under distributional shift. It does not necessitate expensive conversational query-to-document relevance mappings for training. Experiments on QReCC, TopiOCQA, and TREC CAsT demonstrate RCEM consistently outperforms strong baselines, achieving up to 20% improvement in Recall@10 under distributional shift. RCEM also preserves the base embedding model's original retrieval functionality, allowing a single model to handle both standalone and conversational queries against existing document indexes.
Key takeaway
For Machine Learning Engineers building RAG systems, RCEM offers a compelling solution to integrate robust conversational search efficiently. You can achieve significant performance gains, especially under distributional shift, while reducing inference latency and computational overhead by eliminating explicit LLM-based query rewriting. Consider adopting RCEM to enhance your RAG system's conversational capabilities without requiring expensive, high-quality conversational query-to-document relevance data for training.
Key insights
RCEM distills LLM query rewriting into an embedder for robust, low-latency conversational search without explicit rewriting.
Principles
- Align conversational inputs to rewritten query embeddings.
- Preserve the original embedding space for compatibility.
- Avoid direct conversation-to-document relevance supervision.
Method
RCEM trains an embedder ($F_{\theta}$) on a frozen base ($G(x)$) using LoRA and MLP. It minimizes a combined point and pair loss to map conversational inputs to LLM-rewritten query embeddings and maintain the original embedding space.
In practice
- Integrate conversational search into RAG systems.
- Use a single model for all query types.
- Avoid rebuilding existing document indexes.
Topics
- Conversational Search
- Retrieval-Augmented Generation
- Dense Retrieval
- Query Rewriting
- Distributional Shift
- Embedding Models
- LLM Distillation
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.