HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

2026-06-12 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

HybridCodeAuthorship is a novel benchmark dataset designed for line-level code authorship detection in hybrid AI- and human-authored codebases. Addressing limitations of existing benchmarks that assume entirely human or AI code, this Python dataset simulates authentic AI code assistant usage by interleaving human and AI-generated lines. Constructed using a pipeline that leverages CodeSearchNet, the benchmark comprises 10,488 records from 4,196 Python files, featuring 17% (488,896) AI-generated lines out of 2,827,938 total. The dataset was created using LLMs like Llama3.3-70B, Llama-4-Scout, and GPT-OSS-120b. Initial benchmarking of two state-of-the-art detection algorithms, DroidDetect and AIGCode Detector, revealed the task's difficulty, with AIGCode Detector achieving F1 scores of 0.48 for chunk-level and 0.56 for line-level detection.

Key takeaway

For AI Engineers developing or evaluating AI-generated code detection algorithms, you should integrate HybridCodeAuthorship into your benchmarking process. This dataset provides a crucial, realistic testbed for line-level detection in hybrid codebases, moving beyond simplistic, fully AI- or human-authored snippets. Leveraging this benchmark will help you develop more robust algorithms capable of handling the complex, interleaved code found in modern industry applications, improving risk management and productivity analysis.

Key insights

The increasing hybrid nature of codebases necessitates fine-grained, line-level AI authorship detection benchmarks.

Principles

Industry codebases are increasingly hybrid, mixing human and AI contributions.
Existing AI code detection benchmarks lack real-world interleaved authorship.
Line-level detection is crucial for fine-grained risk and productivity analysis.

Method

The HybridCodeAuthorship pipeline involves code testing (unit test validation for human and AI code) and code interleaving (identifying, masking, and LLM-generating code segments based on prompts and target percentages).

In practice

Use HybridCodeAuthorship to benchmark line-level AI code detection.
Filter non-functional code using "Unit Test Passed" labels.
Focus detection algorithms on nontrivial code segments.

Topics

HybridCodeAuthorship
AI-Generated Code Detection
Line-Level Authorship
Benchmark Datasets
Large Language Models
CodeSearchNet

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.