Building a Production RAG Pipeline: Webhooks, Deduplication, and 40k Documents
Summary
Building a production RAG pipeline for an enterprise workplace safety training platform, scaled to forty thousand documents, involves significant data engineering challenges. The architecture features two distinct ingestion paths: an asynchronous, event-driven file upload via Azure Event Grid for bulk content migration, and a synchronous HTTPS webhook for real-time content updates from a CMS. Both paths converge on an ingestion core that performs Azure Content Safety checks, content parsing, chunking, and atomic deduplication. The system employs a dual embedding strategy, using 1536-dimensional vectors for permanent storage and 128-dimensional vectors for faster query-time retrieval, all persisted in PostgreSQL with the pgvector extension and HNSW indexes. A critical silent deduplication bug, caused by duplicate event subscriptions and non-atomic upserts, degraded retrieval quality by creating redundant embeddings.
Key takeaway
For MLOps Engineers building production RAG systems with continuously updating knowledge bases, prioritize robust data engineering. Model each data flow independently, enforce atomicity at every read-write boundary to prevent silent data corruption, and proactively instrument your vector store with metrics like chunk counts. This approach helps detect subtle issues like duplicate embeddings early, ensuring retrieval quality and system reliability before degradation becomes noticeable.
Key insights
Production RAG pipelines are primarily data engineering challenges, requiring robust solutions for continuous updates and data integrity.
Principles
- Production RAG pipelines require distinct ingestion speeds for bulk and incremental updates.
- Content safety checks should be applied at both ingestion and query layers.
- Dual embedding strategies can balance retrieval quality and query latency.
Method
The pipeline uses Azure Event Grid for async file uploads and direct HTTPS webhooks for CMS updates, converging on a core that performs Azure Content Safety scanning, parsing, chunking, and atomic upsert deduplication before generating dual embeddings (1536D for storage, 128D for query) in PostgreSQL with pgvector and HNSW indexes.
In practice
- Implement atomic upsert operations for deduplication in vector stores.
- Audit event subscriptions to prevent duplicate triggers in event-driven systems.
- Instrument vector stores with chunk count metrics to detect silent data issues.
Topics
- RAG Pipeline Architecture
- Data Ingestion
- Deduplication
- Vector Databases
- Azure Event Grid
- Content Safety
Best for: AI Engineer, MLOps Engineer, Data Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.