Scaling AI Knowledge Systems: Lessons from the DHTMLX MCP Server
Summary
DHTMLX developed the MCP Server, a centralized knowledge layer that provides AI assistants and developer tools with structured access to current documentation across its product line, including Suite widgets, Gantt, and Scheduler. This system utilizes a Retrieval-Augmented Generation (RAG) approach, which retrieves relevant documentation fragments and feeds them to Large Language Models (LLMs) to generate context-grounded responses. While effective for single products, scaling RAG across multiple distinct products, each with unique documentation, introduced complexity. Initial attempts to use a single vector index led to accuracy issues due to mixed contexts. Separating knowledge into product-specific indexes improved accuracy but necessitated a fast and flexible query routing mechanism. DHTMLX addressed this by developing a custom machine learning model for domain classification, optimized for low latency and high accuracy through distillation and 8-bit quantization, achieving performance comparable to TinyBERT while maintaining a smaller footprint.
Key takeaway
For AI Engineers building RAG systems that span multiple distinct product lines, relying on a single, unified knowledge index will likely degrade answer quality. You should instead segment your knowledge base by product and implement a lightweight, specialized machine learning model for query routing. This approach, leveraging techniques like distillation and quantization, ensures both accuracy and the low latency required for real-time AI assistance, preventing context mixing and improving developer efficiency.
Key insights
Scaling RAG systems for multiple distinct products requires intelligent query routing and optimized knowledge structuring.
Principles
- Separate knowledge bases for distinct product domains.
- Balance model size and accuracy for real-time routing.
Method
A custom machine learning model for domain classification was developed using distillation and 8-bit quantization to achieve high accuracy and low latency for routing queries to product-specific RAG indexes.
In practice
- Implement product-specific vector indexes.
- Use distillation for smaller, capable models.
- Apply 8-bit quantization to reduce model size.
Topics
- Retrieval-Augmented Generation
- Model Context Protocol
- Machine Learning Models
- Model Distillation
- Quantization
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence in Plain English - Medium.