Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

This article introduces two novel authorship attribution measures, Rank-Turbulence Delta and Jensen-Shannon Delta, which extend Burrows's classical Delta by utilizing distance functions tailored for probabilistic distributions. The theoretical foundation contrasts centered and uncentered z-scoring of word-frequency vectors, re-framing uncentered vectors as probability distributions. A key development is a token-level decomposition that makes every Delta distance numerically interpretable, aiding close reading and result validation. The methods were evaluated across four literary corpora in English, German, French (from Project Gutenberg), and Russian (the SOCIOLIT corpus, comprising 755 works by 180 authors from the 18th to 21st centuries). Rank-Turbulence Delta achieved attribution accuracy comparable to Cosine Delta, while Jensen-Shannon Delta consistently matched or surpassed the performance of canonical Burrows's Delta. The study also re-evaluated established attribution algorithms on the expanded SOCIOLIT corpus.

Key takeaway

For NLP Engineers or computational linguists working on authorship attribution, consider integrating Jensen-Shannon Delta into your toolkit. Its consistent performance matching or exceeding Burrows's Delta, coupled with the novel token-level interpretability, provides a robust method for validating attribution results and gaining deeper insights into stylistic differences. This approach can enhance the reliability and explainability of your attribution models, particularly when analyzing diverse literary corpora.

Key insights

New Delta metrics offer interpretable, high-accuracy authorship attribution by treating word frequencies as probability distributions.

Principles

Method

The method involves re-casting uncentered word-frequency vectors as probability distributions and applying distance functions (Rank-Turbulence Delta, Jensen-Shannon Delta) for authorship attribution, with a token-level decomposition for interpretability.

In practice

Topics

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.