More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment
Summary
This empirical study, published in June 2026, investigates the impact of agent-generated feedback on human performance in crowdsourced software testing workflows. Building on a prior multi-agent LLM-as-a-Judge assessment backbone that evaluates reports across textuality, adequacy, and competitiveness, the research conducted a controlled four-stage human-subject study with 20 testers across three real-world applications (APP1, APP2, APP3). Results show that agent-generated feedback led to immediate improvements in revised reports (e.g., Group A: +27.65% Adequacy), better first submissions on new tasks after prior feedback exposure (Group A outperformed Group B on TC, DE, TXT by +11.44%, +5.58%, +8.51% respectively), and evidence of partial skill transfer to a later application (group averages rose +7.31% TC, +7.40% ADQ from APP1 to APP3). A post-task questionnaire from 17 participants corroborated that feedback was understandable and actionable.
Key takeaway
For Directors of AI/ML overseeing crowdsourced testing initiatives, integrating LLM-powered assessment agents as feedback providers can significantly enhance report quality and tester capabilities. You should deploy these agents upstream in your workflow to offer actionable guidance on textuality and requirement coverage, moving beyond mere post-hoc judging. This approach fosters continuous improvement, reduces manual review burden, and cultivates transferable skills among your crowdworkers, leading to higher quality submissions and more efficient defect discovery.
Key insights
Agent-generated feedback, derived from LLM-based assessment, significantly improves crowdsourced test report quality and fosters skill transfer.
Principles
- Feedback improves report textuality and adequacy.
- Prior feedback enhances new task performance.
- Skill transfer is partial but meaningful across tasks.
Method
A multi-agent LLM framework assesses reports on textuality and adequacy. Feedback converts checklist findings and requirement-tree mapping into actionable revision guidance for testers.
In practice
- Integrate LLM feedback for immediate report revisions.
- Use agent guidance to improve test case clarity.
- Apply coverage feedback to address functional gaps.
Topics
- Crowdsourced Testing
- Agent-Human Interaction
- LLM-as-a-Judge
- Large Language Models
- Software Testing
- Feedback Mechanisms
Code references
Best for: AI Scientist, Research Scientist, Software Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.