HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance

2025-03-26 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

HumanMCP is a new large-scale dataset designed to evaluate the tool retrieval performance of Large Language Models (LLMs) within the Model Context Protocol (MCP) ecosystem. It addresses a critical gap by providing diverse, human-like user queries for approximately 2,800 tools across 308 MCP servers, building upon the MCP Zero dataset. The dataset was generated using a two-stage Generator-Critic pipeline, employing gpt-4o-mini as the generator and gpt-4o as the critic, to create queries reflecting five distinct user personas (novice to expert). Experiments with GPT-4o-mini, Gemini 2.0 Flash, and Claude 3.5 Haiku showed that model accuracy degrades by about 10% as context size increases from 10 to 100 tools. For large contexts (up to 2,000 tools), Retrieval-Augmented Generation (RAG) with semantic retrieval (SentenceTransformer) improved Gemini 2.0 Flash's accuracy by 11 percentage points over raw context processing.

Key takeaway

For NLP Engineers developing LLM agents that interact with external tools, you should integrate HumanMCP into your evaluation pipelines to assess real-world performance. The dataset's diverse, persona-driven queries will reveal how your models handle varying user intents and context sizes, especially in environments with thousands of tools. Prioritize semantic retrieval methods within RAG architectures to maintain high accuracy as your tool ecosystem scales, as direct LLM processing degrades significantly in noisy, long-context scenarios.

Key insights

HumanMCP offers a persona-driven dataset for evaluating LLM tool retrieval, highlighting performance degradation in large contexts.

Principles

Realistic user queries improve LLM tool-use evaluation.
LLM tool retrieval accuracy degrades with increasing context size.
RAG with semantic retrieval enhances performance in large tool environments.

Method

A two-stage Generator-Critic pipeline (gpt-4o-mini as generator, gpt-4o as critic) synthesizes persona-based, realistic user queries from tool metadata, using a feedback loop for quality.

In practice

Use HumanMCP to benchmark LLM tool selection.
Implement RAG for LLM tool retrieval in large toolsets.
Prioritize semantic retrievers over keyword-based methods for RAG.

Topics

Model Context Protocol
LLM Tool Retrieval
Human-like Query Datasets
Retrieval-Augmented Generation
Long-Context LLMs

Best for: AI Scientist, Research Scientist, NLP Engineer, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.