HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance
Summary
HumanMCP is a new large-scale dataset designed to evaluate the tool retrieval performance of Large Language Models (LLMs) within the Model Context Protocol (MCP) ecosystem. It addresses a critical gap by providing diverse, human-like user queries for approximately 2,800 tools across 308 MCP servers, building upon the MCP Zero dataset. The dataset was generated using a two-stage Generator-Critic pipeline, employing gpt-4o-mini as the generator and gpt-4o as the critic, to create queries reflecting five distinct user personas (novice to expert). Experiments with GPT-4o-mini, Gemini 2.0 Flash, and Claude 3.5 Haiku showed that model accuracy degrades by about 10% as context size increases from 10 to 100 tools. For large contexts (up to 2,000 tools), Retrieval-Augmented Generation (RAG) with semantic retrieval (SentenceTransformer) improved Gemini 2.0 Flash's accuracy by 11 percentage points over raw context processing.
Key takeaway
For NLP Engineers developing LLM agents that interact with external tools, you should integrate HumanMCP into your evaluation pipelines to assess real-world performance. The dataset's diverse, persona-driven queries will reveal how your models handle varying user intents and context sizes, especially in environments with thousands of tools. Prioritize semantic retrieval methods within RAG architectures to maintain high accuracy as your tool ecosystem scales, as direct LLM processing degrades significantly in noisy, long-context scenarios.
Key insights
HumanMCP offers a persona-driven dataset for evaluating LLM tool retrieval, highlighting performance degradation in large contexts.
Principles
- Realistic user queries improve LLM tool-use evaluation.
- LLM tool retrieval accuracy degrades with increasing context size.
- RAG with semantic retrieval enhances performance in large tool environments.
Method
A two-stage Generator-Critic pipeline (gpt-4o-mini as generator, gpt-4o as critic) synthesizes persona-based, realistic user queries from tool metadata, using a feedback loop for quality.
In practice
- Use HumanMCP to benchmark LLM tool selection.
- Implement RAG for LLM tool retrieval in large toolsets.
- Prioritize semantic retrievers over keyword-based methods for RAG.
Topics
- Model Context Protocol
- LLM Tool Retrieval
- Human-like Query Datasets
- Retrieval-Augmented Generation
- Long-Context LLMs
Best for: AI Scientist, Research Scientist, NLP Engineer, AI Researcher, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.