Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection
Summary
OpAI-Bench is a new operation-guided benchmark for multi-granularity AI-text detection, addressing limitations of existing benchmarks that focus on final outputs. It constructs nine sequentially revised versions for each human-written document, simulating progressive human-to-AI co-editing across four domains: student essays, news articles, government reports, and scientific abstracts. The benchmark uses five AI edit operations (polish, paraphrase, style rewrite, compress, and expand) while preserving authorship provenance at document, sentence, token, and span granularities. Experiments with 8 document-level, 7 sentence-level, and 2 fine-grained detectors reveal that AI-text detectability is non-monotonic, influenced by edit operation, domain, and cumulative revision history, not solely by the proportion of AI-edited content. Mixed-authorship intermediate versions, particularly around v4 with compression, are often harder to detect than fully human or heavily AI-edited texts.
Key takeaway
For Machine Learning Engineers developing AI-text detection systems, you should move beyond binary endpoint classification and incorporate trajectory-aware and operation-aware evaluation. Your models must account for non-monotonic detectability, especially for mixed-authorship content and specific edit operations like compression, which can significantly reduce detectability. This approach will lead to more robust and reliable detection tools for real-world human-AI co-editing workflows.
Key insights
AI-text detectability is non-monotonic, influenced by edit operations and revision history, not just AI content proportion.
Principles
- AI authorship signals emerge, accumulate, or disappear progressively.
- Mixed-authorship intermediate versions are often harder to detect.
- Detectability is strongly influenced by edit operation and domain.
Method
OpAI-Bench constructs nine versions per human document, progressively editing the previous version using five AI operations (polish, paraphrase, style rewrite, compress, expand) at increasing AI coverage, preserving multi-granularity provenance.
In practice
- Evaluate detectors across progressive revision stages.
- Analyze detection performance for specific edit operations.
Topics
- AI-text Detection
- Human-AI Co-editing
- Text Transformation Benchmarks
- Multi-Granularity Provenance
- Edit Operations
- Non-monotonic Detectability
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.