Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
Summary
OpAI-Bench is a new operation-guided benchmark designed to study progressive human-to-AI text transformation for multi-granularity AI-text detection. It addresses the gap in existing benchmarks by focusing on co-editing workflows rather than just final outputs. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample, incorporating predefined AI coverage levels and five representative AI edit operations across four domains, while preserving complete authorship provenance. The benchmark supports comprehensive evaluation using 8 document-level, 7 sentence-level, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is influenced by the proportion of AI-edited content, edit operation, domain, and cumulative revision history. Notably, mixed-authorship intermediate versions are often harder to detect than fully human or heavily AI-edited endpoints, indicating non-monotonic detection patterns.
Key takeaway
For AI Scientists and Machine Learning Engineers developing AI-text detectors, you should account for the complexities of human-AI co-editing workflows. Your evaluation metrics must consider that detectability is not solely proportional to AI-generated content, as mixed-authorship intermediate versions can be significantly harder to identify. Integrate progressive transformation benchmarks like OpAI-Bench into your testing to reveal non-monotonic detection patterns and improve detector robustness against realistic revision scenarios.
Key insights
AI-text detectability is non-monotonic, influenced by edit operations and revision history, not just AI content proportion.
Principles
- AI authorship signals evolve progressively.
- Detectability depends on edit operation and domain.
- Mixed-authorship texts are harder to detect.
Method
OpAI-Bench constructs nine sequential revisions per sample from human text, using predefined AI coverage and five AI edit operations across four domains, preserving multi-granularity authorship.
In practice
- Evaluate detectors on progressive revisions.
- Test across diverse AI edit operations.
- Account for non-monotonic detection.
Topics
- AI-text Detection
- Human-AI Co-editing
- Text Transformation
- Benchmark Development
- Authorship Analysis
- Multi-Granularity Detection
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.