3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1
Summary
Databricks has significantly updated its Agent Bricks Knowledge Assistant, achieving a 2x reduction in answer generation time and over a 3x reduction in search time, bringing Time To First Token (TTFT) to approximately two seconds and end-to-end latency consistently below 10 seconds. These improvements are powered by Instructed-Retriever-1, a retrieval-specialized model designed for parallel test-time scaling. Unlike sequential agentic retrieval, this approach executes query generation for recall and multi-pivot groupwise reranking for precision in parallel, enabling broader search and more relevant context selection upfront. Instructed-Retriever-1, trained on synthetic enterprise-style data, matches Claude Sonnet 4.5 retrieval quality on KARLBench. Serving performance is enhanced through a Mixture-of-Experts architecture, FP8 quantization, and speculative decoding, which adds over 30% speed-up.
Key takeaway
For MLOps Engineers deploying knowledge assistants, consider adopting parallel test-time scaling to drastically improve search latency and answer generation without sacrificing quality. Your systems can achieve 3x faster search and 2x faster answer generation by utilizing a single retrieval-specialized model for parallel query generation and multi-pivot reranking. Implement serving optimizations like FP8 quantization and speculative decoding to ensure efficient, low-latency production performance, enabling users to get faster, more relevant answers.
Key insights
Parallel test-time scaling with a specialized model significantly boosts retrieval speed and quality by executing search stages concurrently.
Principles
- Parallelizing compute during search improves quality and reduces latency.
- A single model can effectively handle both query generation and reranking.
- Increasing query formulations improves recall; more pivots enhance precision.
Method
Train a single retrieval-specialized model in two stages for parallel query generation and multi-pivot groupwise reranking, then optimize serving with MoE, FP8 quantization, and speculative decoding.
In practice
- Implement parallel query generation to broaden search and improve recall.
- Use multi-pivot groupwise reranking for efficient context selection.
- Apply FP8 quantization and speculative decoding for inference speed-up.
Topics
- Instructed-Retriever-1
- Parallel Test-Time Scaling
- Retrieval-Augmented Generation
- LLM Inference Optimization
- Knowledge Assistant
- Enterprise Search
Best for: AI Architect, NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.