ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

The ScholaWrite dataset is introduced as the first-of-its-kind collection of end-to-end scholarly writing process data, capturing keystroke logs for complete LaTeX-based manuscripts. Developed by the University of Minnesota, this dataset includes nearly 62,000 total text changes from five preprints, collected over four months from 10 computer science graduate students using a custom Chrome extension for Overleaf. Each keystroke is thoroughly annotated with cognitive writing intentions, categorized into Planning, Implementation, and Revision, based on an expanded taxonomy. This unique resource aims to enable the development of advanced AI writing assistants that understand human cognitive processes, moving beyond simple LLM prompting. Experiments with a Llama-8B model fine-tuned on ScholaWrite demonstrated high linguistic quality in generated text, highlighting the importance of end-to-end data for supporting scientists' cognitive thinking.

Key takeaway

For NLP Engineers developing AI writing assistants, you should prioritize collecting and utilizing end-to-end writing process data, not just final manuscripts. This approach, exemplified by ScholaWrite's keystroke logs and cognitive annotations, is essential for building models that truly understand and support human iterative thinking. Your efforts should focus on developing systems that align with complex cognitive behaviors, moving beyond basic autoregressive generation to enable more sophisticated, context-aware assistance.

Key insights

End-to-end keystroke data with cognitive annotations is crucial for developing advanced, cognitively-aligned AI writing assistants.

Principles

Method

A Chrome extension collects real-time LaTeX keystrokes from Overleaf, which are then annotated using a specialized interface and a comprehensive taxonomy of 15 writing intentions grouped into Planning, Implementation, and Revision.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.