DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation
Summary
DCD (Domain–Collection–Document) is a domain-oriented design for Retrieval-Augmented Generation (RAG) systems, developed by Valeriy Kovalskiy et al. at red_mad_robot. It addresses quality degradation in Naive RAG pipelines when dealing with heterogeneous corpora and multi-step queries by introducing a hierarchical knowledge structure and controlled query processing. The architecture decomposes information into Domain, Collection, and Document levels, using a DCD Router for multi-stage navigation based on structured language model outputs. It integrates smart chunking with a sliding window (5-10% overlap) and fast guardrail mechanisms for early validation and hallucination prevention. Experiments on a synthetic dataset, utilizing gpt-oss-120b for generation and evaluation, bge-m3 for embeddings, and qwen2.5-7b-instruct for answer generation, demonstrated DCD's significant advantage over Naive RAG in retrieval accuracy, with comparable answer generation quality.
Key takeaway
For AI Engineers designing Retrieval-Augmented Generation systems for complex, heterogeneous data, you should consider implementing DCD's hierarchical knowledge organization and multi-stage routing. This approach significantly improves retrieval accuracy and factual consistency, especially for template-based data, by constraining retrieval to semantically homogeneous regions. While requiring initial configuration, DCD offers enhanced robustness and control, justifying the setup effort for more reliable RAG deployments.
Key insights
DCD improves RAG accuracy and robustness by structuring knowledge hierarchically and controlling query processing without modifying the LLM.
Principles
- Answer quality improves when retrieval is constrained to semantically homogeneous knowledge regions.
- Explicit knowledge hierarchies and controlled workflows enhance RAG system accuracy and safety.
- Multi-stage routing with structured LLM outputs ensures transparency and reproducibility.
Method
DCD structures knowledge into Domain-Collection-Document hierarchy. A DCD Router uses structured LLM outputs for top-down navigation, selecting relevant segments. Smart chunking and fast guardrails complement the pipeline.
In practice
- Implement hierarchical knowledge segmentation (Domain, Collection, Document).
- Use structured LLM outputs for transparent, reproducible routing decisions.
- Integrate fast guardrails for early response validation and hallucination prevention.
Topics
- Retrieval-Augmented Generation
- Knowledge Hierarchy
- Multi-stage Routing
- LLM Guardrails
- Smart Chunking
- Semantic Segmentation
Best for: AI Architect, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.