DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

DCD (Domain–Collection–Document) is a domain-oriented design for Retrieval-Augmented Generation (RAG) systems, developed by Valeriy Kovalskiy et al. at red_mad_robot. It addresses quality degradation in Naive RAG pipelines when dealing with heterogeneous corpora and multi-step queries by introducing a hierarchical knowledge structure and controlled query processing. The architecture decomposes information into Domain, Collection, and Document levels, using a DCD Router for multi-stage navigation based on structured language model outputs. It integrates smart chunking with a sliding window (5-10% overlap) and fast guardrail mechanisms for early validation and hallucination prevention. Experiments on a synthetic dataset, utilizing gpt-oss-120b for generation and evaluation, bge-m3 for embeddings, and qwen2.5-7b-instruct for answer generation, demonstrated DCD's significant advantage over Naive RAG in retrieval accuracy, with comparable answer generation quality.

Key takeaway

For AI Engineers designing Retrieval-Augmented Generation systems for complex, heterogeneous data, you should consider implementing DCD's hierarchical knowledge organization and multi-stage routing. This approach significantly improves retrieval accuracy and factual consistency, especially for template-based data, by constraining retrieval to semantically homogeneous regions. While requiring initial configuration, DCD offers enhanced robustness and control, justifying the setup effort for more reliable RAG deployments.

Key insights

DCD improves RAG accuracy and robustness by structuring knowledge hierarchically and controlling query processing without modifying the LLM.

Principles

Answer quality improves when retrieval is constrained to semantically homogeneous knowledge regions.
Explicit knowledge hierarchies and controlled workflows enhance RAG system accuracy and safety.
Multi-stage routing with structured LLM outputs ensures transparency and reproducibility.

Method

DCD structures knowledge into Domain-Collection-Document hierarchy. A DCD Router uses structured LLM outputs for top-down navigation, selecting relevant segments. Smart chunking and fast guardrails complement the pipeline.

In practice

Implement hierarchical knowledge segmentation (Domain, Collection, Document).
Use structured LLM outputs for transparent, reproducible routing decisions.
Integrate fast guardrails for early response validation and hallucination prevention.

Topics

Retrieval-Augmented Generation
Knowledge Hierarchy
Multi-stage Routing
LLM Guardrails
Smart Chunking
Semantic Segmentation

Best for: AI Architect, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.