HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

HumanMCP is a new large-scale dataset designed to evaluate the tool retrieval performance of Large Language Models (LLMs) within the Model Context Protocol (MCP) ecosystem. It addresses a critical gap by providing diverse, human-like user queries for approximately 2,800 tools across 308 MCP servers, building upon the MCP Zero dataset. The dataset was generated using a two-stage Generator-Critic pipeline, employing gpt-4o-mini as the generator and gpt-4o as the critic, to create queries reflecting five distinct user personas (novice to expert). Experiments with GPT-4o-mini, Gemini 2.0 Flash, and Claude 3.5 Haiku showed that model accuracy degrades by about 10% as context size increases from 10 to 100 tools. For large contexts (up to 2,000 tools), Retrieval-Augmented Generation (RAG) with semantic retrieval (SentenceTransformer) improved Gemini 2.0 Flash's accuracy by 11 percentage points over raw context processing.

Key takeaway

For NLP Engineers developing LLM agents that interact with external tools, you should integrate HumanMCP into your evaluation pipelines to assess real-world performance. The dataset's diverse, persona-driven queries will reveal how your models handle varying user intents and context sizes, especially in environments with thousands of tools. Prioritize semantic retrieval methods within RAG architectures to maintain high accuracy as your tool ecosystem scales, as direct LLM processing degrades significantly in noisy, long-context scenarios.

Key insights

HumanMCP offers a persona-driven dataset for evaluating LLM tool retrieval, highlighting performance degradation in large contexts.

Principles

Method

A two-stage Generator-Critic pipeline (gpt-4o-mini as generator, gpt-4o as critic) synthesizes persona-based, realistic user queries from tool metadata, using a feedback loop for quality.

In practice

Topics

Best for: AI Scientist, Research Scientist, NLP Engineer, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.