Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

2026-05-04 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

OpAI-Bench is a new operation-guided benchmark for multi-granularity AI-text detection, addressing limitations of existing benchmarks that focus on final outputs. It constructs nine sequentially revised versions for each human-written document, simulating progressive human-to-AI co-editing across four domains: student essays, news articles, government reports, and scientific abstracts. The benchmark uses five AI edit operations (polish, paraphrase, style rewrite, compress, and expand) while preserving authorship provenance at document, sentence, token, and span granularities. Experiments with 8 document-level, 7 sentence-level, and 2 fine-grained detectors reveal that AI-text detectability is non-monotonic, influenced by edit operation, domain, and cumulative revision history, not solely by the proportion of AI-edited content. Mixed-authorship intermediate versions, particularly around v4 with compression, are often harder to detect than fully human or heavily AI-edited texts.

Key takeaway

For Machine Learning Engineers developing AI-text detection systems, you should move beyond binary endpoint classification and incorporate trajectory-aware and operation-aware evaluation. Your models must account for non-monotonic detectability, especially for mixed-authorship content and specific edit operations like compression, which can significantly reduce detectability. This approach will lead to more robust and reliable detection tools for real-world human-AI co-editing workflows.

Key insights

AI-text detectability is non-monotonic, influenced by edit operations and revision history, not just AI content proportion.

Principles

AI authorship signals emerge, accumulate, or disappear progressively.
Mixed-authorship intermediate versions are often harder to detect.
Detectability is strongly influenced by edit operation and domain.

Method

OpAI-Bench constructs nine versions per human document, progressively editing the previous version using five AI operations (polish, paraphrase, style rewrite, compress, expand) at increasing AI coverage, preserving multi-granularity provenance.

In practice

Evaluate detectors across progressive revision stages.
Analyze detection performance for specific edit operations.

Topics

AI-text Detection
Human-AI Co-editing
Text Transformation Benchmarks
Multi-Granularity Provenance
Edit Operations
Non-monotonic Detectability

Code references

VILA-Lab/OpAI-Bench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.