The Unexpected Challenges of Building a RAG Ingestion Service for 1 Million Documents

2026-06-27 · Source: Data Engineering on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Advanced, medium

Summary

A retrieval-augmented generation (RAG) ingestion service designed for one million heterogeneous documents (PDFs, HTML, DOCX, plain text) encountered significant operational challenges during its build and deployment. The five-stage pipeline, utilizing SQS, S3, and pgvector with an external embedding API like text-embedding-3-small, faced issues including hitting embedding API token-per-minute rate limits, generating silent empty-string chunks from partially scanned PDFs (leading to 3.2% chunk rejection), and experiencing pgvector HNSW index pressure during bulk upserts. Solutions involved implementing a token-bucket rate limiter, adding a post-parse chunk validation step with a 20-token minimum, and managing index builds. The project also emphasized cost control through deduplication and chunk-level metering, alongside critical observability metrics like DLQ depth and embedding API error rates.

Key takeaway

For MLOps Engineers building large-scale RAG ingestion pipelines, prioritize robust error handling and data quality from day one. Implement client-side rate limiters for external embedding APIs and validate chunk content to prevent silent degradation. You should also plan for pgvector HNSW index management during bulk loads and deploy DLQ replay tooling early to ensure graceful recovery and cost control.

Key insights

Building RAG ingestion at scale demands robust error handling, cost control, and meticulous data quality validation.

Principles

Distributed ingestion requires durable queues.
Rate limits necessitate client-side throttling.
Data quality issues scale silently.

Method

The proposed RAG ingestion pipeline involves five stages: Intake API, Parser Workers, Chunking, Embedding Workers, and Vector Store Writer, separated by SQS queues and S3 for durability and scalability.

In practice

Implement token-bucket rate limiting for external APIs.
Validate chunk content post-parse (e.g., min 20 tokens).
Disable HNSW index during bulk pgvector loads.

Topics

RAG Ingestion
Distributed Systems
pgvector
Embedding APIs
Data Quality
Rate Limiting

Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.