Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics
Summary
This article introduces two novel authorship attribution measures, Rank-Turbulence Delta and Jensen-Shannon Delta, which extend Burrows's classical Delta by utilizing distance functions tailored for probabilistic distributions. The theoretical foundation contrasts centered and uncentered z-scoring of word-frequency vectors, re-framing uncentered vectors as probability distributions. A key development is a token-level decomposition that makes every Delta distance numerically interpretable, aiding close reading and result validation. The methods were evaluated across four literary corpora in English, German, French (from Project Gutenberg), and Russian (the SOCIOLIT corpus, comprising 755 works by 180 authors from the 18th to 21st centuries). Rank-Turbulence Delta achieved attribution accuracy comparable to Cosine Delta, while Jensen-Shannon Delta consistently matched or surpassed the performance of canonical Burrows's Delta. The study also re-evaluated established attribution algorithms on the expanded SOCIOLIT corpus.
Key takeaway
For NLP Engineers or computational linguists working on authorship attribution, consider integrating Jensen-Shannon Delta into your toolkit. Its consistent performance matching or exceeding Burrows's Delta, coupled with the novel token-level interpretability, provides a robust method for validating attribution results and gaining deeper insights into stylistic differences. This approach can enhance the reliability and explainability of your attribution models, particularly when analyzing diverse literary corpora.
Key insights
New Delta metrics offer interpretable, high-accuracy authorship attribution by treating word frequencies as probability distributions.
Principles
- Uncentered z-scoring yields probability distributions.
- Token-level decomposition enhances Delta interpretability.
Method
The method involves re-casting uncentered word-frequency vectors as probability distributions and applying distance functions (Rank-Turbulence Delta, Jensen-Shannon Delta) for authorship attribution, with a token-level decomposition for interpretability.
In practice
- Apply Jensen-Shannon Delta for improved attribution.
- Use token decomposition to validate results.
Topics
- Rank-Turbulence Delta
- Jensen-Shannon Delta
- Authorship Attribution
- Stylometric Delta Metrics
- Probabilistic Distributions
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.