Same Scrutiny, More Time: Eye Tracking Insights into Reviewing LLM-Labelled Code
Summary
A Wizard-of-Oz experiment with 32 software engineers investigated how the explicit labeling of code as LLM-generated influences code review behavior. Researchers used eye-tracking and exit interviews, analyzing fixation durations and saccade lengths on human-written code presented as either LLM-generated or unlabelled. The study found that while review thoroughness, measured by saccade lengths, remained unchanged, participants spent significantly more time fixating on LLM-labeled code, with increases up to 60% for complex segments. Qualitatively, 37.5% of reviewers adjusted their evaluation criteria based on their trust in AI, prioritizing logical correctness or code quality. Additionally, 14 participants utilized the embedded prompt as either a requirement specification for validation or as contextual documentation, highlighting its role as a valuable software artifact.
Key takeaway
For software engineering teams adopting LLMs for code generation, recognize that simply labeling code as AI-generated increases review time, especially for complex sections, without necessarily improving defect detection. You should implement clear, structured AI policies that specify "how" to verify LLM-generated code, including constraints on code size and submission timing. Additionally, ensure prompts are available on-demand within the development environment as metadata, enabling reviewers to validate code against original intent without disrupting workflow.
Key insights
Explicitly labeling code as LLM-generated increases review time without improving thoroughness, driven by perceived origin.
Principles
- Perceived code provenance impacts attention.
- Self-reported trust doesn't predict review thoroughness.
- AI policies need concrete verification steps.
Method
A Wizard-of-Oz experiment used eye-tracking and interviews with 32 software engineers reviewing human-written code, some segments explicitly labeled as LLM-generated, to isolate the effect of perceived origin.
In practice
- Integrate prompt-to-code traceability in IDEs.
- Define specific LLM code review guidelines.
- Monitor review time for LLM-generated code.
Topics
- Large Language Models
- Code Review
- Eye Tracking
- Software Engineering Policy
- Prompt Engineering
- Human-AI Interaction
Code references
Best for: Machine Learning Engineer, NLP Engineer, AI Scientist, Software Engineer, AI Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.