Three Buddhist Vocabularies: Computational Stylometry of the English Pali Canon across Sutta, Vinaya, and Abhidhamma
Summary
A computational stylometric analysis of the English Pali Canon, known as the Tipitaka, was conducted across its Sutta, Vinaya, and Abhidhamma Pitakas, expanding on prior work focused solely on the Sutta Pitaka. The corpus comprises 134,831 segments, including 114,591 from Bhikkhu Sujato's Sutta Pitaka, 7,923 from Bhikkhu Brahmali's Vinaya, 2,826 from I.B. Horner's 1938 Vinaya, 2,077 from three Abhidhammattha Sangaha translations, and cross-tradition Vinaya texts. Methods included Zipf rank-frequency distributions, Moving Average TTR (MATTR-500), numeral-word density, and vocabulary overlap. Key findings indicate all corpora show Zipf-consistent distributions (R2 > 0.989), with the Vinaya closest to an ideal slope of -1. Lexical diversity (MATTR-500) is similar for Sutta (0.399) and Theravada Vinaya (0.400) but higher for Sangaha (0.560), which also has the highest numeral-word density (3.26%). Cross-tradition Vinaya texts share 20.0% Jaccard vocabulary with Theravada Vinaya, while two English translations of the same Vinaya text, 88 years apart, share only 24.2% vocabulary, revealing significant semantic shifts.
Key takeaway
For computational linguists or digital humanities researchers analyzing historical or translated corpora, this study demonstrates how quantitative stylometric methods can reveal subtle yet significant differences in vocabulary, lexical diversity, and translation evolution across large textual datasets. You should consider applying these techniques (Zipf distributions, MATTR, vocabulary overlap) to your own projects to uncover hidden linguistic patterns or track semantic shifts over time, especially when dealing with multiple translations or versions of a source text.
Key insights
Computational stylometry reveals distinct lexical patterns and significant translation shifts across Buddhist canonical texts.
Principles
- Zipf's Law applies consistently across diverse textual corpora.
- Lexical diversity varies significantly between canonical divisions.
- Translation choices evolve substantially over decades.
Method
The analysis employs Zipf rank-frequency distributions, Moving Average TTR (MATTR-500), numeral-word density, and vocabulary overlap using Jaccard and Szymkiewicz-Simpson coefficients.
In practice
- Apply stylometry to identify textual authorship or dating.
- Quantify lexical shifts in historical translations.
- Characterize specialized vocabularies in religious texts.
Topics
- Computational Stylometry
- Pali Canon
- Buddhist Texts
- Lexical Diversity
- Translation Studies
- Zipf's Law
- Digital Humanities
Best for: AI Scientist, Research Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.