HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection
Summary
HybridCodeAuthorship is a novel benchmark dataset designed for line-level code authorship detection in hybrid AI- and human-authored codebases. Addressing limitations of existing benchmarks that assume entirely human or AI code, this Python dataset simulates authentic AI code assistant usage by interleaving human and AI-generated lines. Constructed using a pipeline that leverages CodeSearchNet, the benchmark comprises 10,488 records from 4,196 Python files, featuring 17% (488,896) AI-generated lines out of 2,827,938 total. The dataset was created using LLMs like Llama3.3-70B, Llama-4-Scout, and GPT-OSS-120b. Initial benchmarking of two state-of-the-art detection algorithms, DroidDetect and AIGCode Detector, revealed the task's difficulty, with AIGCode Detector achieving F1 scores of 0.48 for chunk-level and 0.56 for line-level detection.
Key takeaway
For AI Engineers developing or evaluating AI-generated code detection algorithms, you should integrate HybridCodeAuthorship into your benchmarking process. This dataset provides a crucial, realistic testbed for line-level detection in hybrid codebases, moving beyond simplistic, fully AI- or human-authored snippets. Leveraging this benchmark will help you develop more robust algorithms capable of handling the complex, interleaved code found in modern industry applications, improving risk management and productivity analysis.
Key insights
The increasing hybrid nature of codebases necessitates fine-grained, line-level AI authorship detection benchmarks.
Principles
- Industry codebases are increasingly hybrid, mixing human and AI contributions.
- Existing AI code detection benchmarks lack real-world interleaved authorship.
- Line-level detection is crucial for fine-grained risk and productivity analysis.
Method
The HybridCodeAuthorship pipeline involves code testing (unit test validation for human and AI code) and code interleaving (identifying, masking, and LLM-generating code segments based on prompts and target percentages).
In practice
- Use HybridCodeAuthorship to benchmark line-level AI code detection.
- Filter non-functional code using "Unit Test Passed" labels.
- Focus detection algorithms on nontrivial code segments.
Topics
- HybridCodeAuthorship
- AI-Generated Code Detection
- Line-Level Authorship
- Benchmark Datasets
- Large Language Models
- CodeSearchNet
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.