Building a Production RAG Pipeline: Webhooks, Deduplication, and 40k Documents

· Source: Data Engineering on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Building a production RAG pipeline for an enterprise workplace safety training platform, scaled to forty thousand documents, involves significant data engineering challenges. The architecture features two distinct ingestion paths: an asynchronous, event-driven file upload via Azure Event Grid for bulk content migration, and a synchronous HTTPS webhook for real-time content updates from a CMS. Both paths converge on an ingestion core that performs Azure Content Safety checks, content parsing, chunking, and atomic deduplication. The system employs a dual embedding strategy, using 1536-dimensional vectors for permanent storage and 128-dimensional vectors for faster query-time retrieval, all persisted in PostgreSQL with the pgvector extension and HNSW indexes. A critical silent deduplication bug, caused by duplicate event subscriptions and non-atomic upserts, degraded retrieval quality by creating redundant embeddings.

Key takeaway

For MLOps Engineers building production RAG systems with continuously updating knowledge bases, prioritize robust data engineering. Model each data flow independently, enforce atomicity at every read-write boundary to prevent silent data corruption, and proactively instrument your vector store with metrics like chunk counts. This approach helps detect subtle issues like duplicate embeddings early, ensuring retrieval quality and system reliability before degradation becomes noticeable.

Key insights

Production RAG pipelines are primarily data engineering challenges, requiring robust solutions for continuous updates and data integrity.

Principles

Method

The pipeline uses Azure Event Grid for async file uploads and direct HTTPS webhooks for CMS updates, converging on a core that performs Azure Content Safety scanning, parsing, chunking, and atomic upsert deduplication before generating dual embeddings (1536D for storage, 128D for query) in PostgreSQL with pgvector and HNSW indexes.

In practice

Topics

Best for: AI Engineer, MLOps Engineer, Data Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.