Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality
Summary
Neeva, a private and ad-free search engine founded by ex-Googlers, employs human raters to evaluate search quality, particularly for technical programming queries. This approach addresses the limitations of traditional engagement metrics like clickthrough rate, dwell time, and number of searches, which can be misleading and incentivize "clickbait" over useful results. Neeva's methodology involves building specialized rating teams with programming backgrounds and using personalized query sets derived from raters' own search histories to ensure accurate intent understanding. The evaluation design utilizes a 5-point Likert scale for side-by-side comparisons of Neeva and Google, assessing both individual search results and overall Search Engine Results Page (SERP) satisfaction. This human evaluation framework enables offline experimentation, machine learning model training, the creation of "golden sets" for performance tracking, and provides data science insights into search engine strengths and weaknesses.
Key takeaway
For AI Product Managers developing search engines, relying solely on engagement metrics like clicks or dwell time can lead to suboptimal results and "clickbait." You should integrate human evaluation with specialized raters, especially for niche domains, to accurately measure search quality, train models, and validate experiments. This approach provides granular feedback and ensures alignment with user satisfaction over short-term engagement.
Key insights
Human evaluation with specialized raters is critical for accurate search quality measurement, surpassing engagement metrics.
Principles
- Engagement metrics can mislead search quality assessment.
- Specialized raters improve relevance evaluation accuracy.
- Side-by-side comparisons yield clearer distinctions.
Method
Neeva uses human raters with programming expertise to conduct side-by-side evaluations of search results for personalized technical queries, scoring relevance and overall SERP satisfaction on a 5-point Likert scale.
In practice
- Use human evaluation for rapid A/B testing.
- Generate high-quality datasets for ML training.
- Establish "golden sets" for quarterly performance OKRs.
Topics
- Search Quality Measurement
- Human Evaluation
- Search Ranking
- Programming Queries
- Neeva Search Engine
Best for: AI Scientist, Research Scientist, AI Product Manager, Machine Learning Engineer, AI Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.