Building a Production RAG Pipeline: Webhooks, Deduplication, and 40k Documents

2026-05-30 · Source: Data Engineering on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Building a production RAG pipeline for an enterprise workplace safety training platform, scaled to forty thousand documents, involves significant data engineering challenges. The architecture features two distinct ingestion paths: an asynchronous, event-driven file upload via Azure Event Grid for bulk content migration, and a synchronous HTTPS webhook for real-time content updates from a CMS. Both paths converge on an ingestion core that performs Azure Content Safety checks, content parsing, chunking, and atomic deduplication. The system employs a dual embedding strategy, using 1536-dimensional vectors for permanent storage and 128-dimensional vectors for faster query-time retrieval, all persisted in PostgreSQL with the pgvector extension and HNSW indexes. A critical silent deduplication bug, caused by duplicate event subscriptions and non-atomic upserts, degraded retrieval quality by creating redundant embeddings.

Key takeaway

For MLOps Engineers building production RAG systems with continuously updating knowledge bases, prioritize robust data engineering. Model each data flow independently, enforce atomicity at every read-write boundary to prevent silent data corruption, and proactively instrument your vector store with metrics like chunk counts. This approach helps detect subtle issues like duplicate embeddings early, ensuring retrieval quality and system reliability before degradation becomes noticeable.

Key insights

Production RAG pipelines are primarily data engineering challenges, requiring robust solutions for continuous updates and data integrity.

Principles

Production RAG pipelines require distinct ingestion speeds for bulk and incremental updates.
Content safety checks should be applied at both ingestion and query layers.
Dual embedding strategies can balance retrieval quality and query latency.

Method

The pipeline uses Azure Event Grid for async file uploads and direct HTTPS webhooks for CMS updates, converging on a core that performs Azure Content Safety scanning, parsing, chunking, and atomic upsert deduplication before generating dual embeddings (1536D for storage, 128D for query) in PostgreSQL with pgvector and HNSW indexes.

In practice

Implement atomic upsert operations for deduplication in vector stores.
Audit event subscriptions to prevent duplicate triggers in event-driven systems.
Instrument vector stores with chunk count metrics to detect silent data issues.

Topics

RAG Pipeline Architecture
Data Ingestion
Deduplication
Vector Databases
Azure Event Grid
Content Safety

Best for: AI Engineer, MLOps Engineer, Data Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.