Same Scrutiny, More Time: Eye Tracking Insights into Reviewing LLM-Labelled Code

2026-06-26 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A Wizard-of-Oz experiment with 32 software engineers investigated how the explicit labeling of code as LLM-generated influences code review behavior. Researchers used eye-tracking and exit interviews, analyzing fixation durations and saccade lengths on human-written code presented as either LLM-generated or unlabelled. The study found that while review thoroughness, measured by saccade lengths, remained unchanged, participants spent significantly more time fixating on LLM-labeled code, with increases up to 60% for complex segments. Qualitatively, 37.5% of reviewers adjusted their evaluation criteria based on their trust in AI, prioritizing logical correctness or code quality. Additionally, 14 participants utilized the embedded prompt as either a requirement specification for validation or as contextual documentation, highlighting its role as a valuable software artifact.

Key takeaway

For software engineering teams adopting LLMs for code generation, recognize that simply labeling code as AI-generated increases review time, especially for complex sections, without necessarily improving defect detection. You should implement clear, structured AI policies that specify "how" to verify LLM-generated code, including constraints on code size and submission timing. Additionally, ensure prompts are available on-demand within the development environment as metadata, enabling reviewers to validate code against original intent without disrupting workflow.

Key insights

Explicitly labeling code as LLM-generated increases review time without improving thoroughness, driven by perceived origin.

Principles

Perceived code provenance impacts attention.
Self-reported trust doesn't predict review thoroughness.
AI policies need concrete verification steps.

Method

A Wizard-of-Oz experiment used eye-tracking and interviews with 32 software engineers reviewing human-written code, some segments explicitly labeled as LLM-generated, to isolate the effect of perceived origin.

In practice

Integrate prompt-to-code traceability in IDEs.
Define specific LLM code review guidelines.
Monitor review time for LLM-generated code.

Topics

Large Language Models
Code Review
Eye Tracking
Software Engineering Policy
Prompt Engineering
Human-AI Interaction

Code references

Best for: Machine Learning Engineer, NLP Engineer, AI Scientist, Software Engineer, AI Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.