More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

This empirical study, published in June 2026, investigates the impact of agent-generated feedback on human performance in crowdsourced software testing workflows. Building on a prior multi-agent LLM-as-a-Judge assessment backbone that evaluates reports across textuality, adequacy, and competitiveness, the research conducted a controlled four-stage human-subject study with 20 testers across three real-world applications (APP1, APP2, APP3). Results show that agent-generated feedback led to immediate improvements in revised reports (e.g., Group A: +27.65% Adequacy), better first submissions on new tasks after prior feedback exposure (Group A outperformed Group B on TC, DE, TXT by +11.44%, +5.58%, +8.51% respectively), and evidence of partial skill transfer to a later application (group averages rose +7.31% TC, +7.40% ADQ from APP1 to APP3). A post-task questionnaire from 17 participants corroborated that feedback was understandable and actionable.

Key takeaway

For Directors of AI/ML overseeing crowdsourced testing initiatives, integrating LLM-powered assessment agents as feedback providers can significantly enhance report quality and tester capabilities. You should deploy these agents upstream in your workflow to offer actionable guidance on textuality and requirement coverage, moving beyond mere post-hoc judging. This approach fosters continuous improvement, reduces manual review burden, and cultivates transferable skills among your crowdworkers, leading to higher quality submissions and more efficient defect discovery.

Key insights

Agent-generated feedback, derived from LLM-based assessment, significantly improves crowdsourced test report quality and fosters skill transfer.

Principles

Method

A multi-agent LLM framework assesses reports on textuality and adequacy. Feedback converts checklist findings and requirement-tree mapping into actionable revision guidance for testers.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Software Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.