Three Buddhist Vocabularies: Computational Stylometry of the English Pali Canon across Sutta, Vinaya, and Abhidhamma

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A computational stylometric analysis of the English Pali Canon, known as the Tipitaka, was conducted across its Sutta, Vinaya, and Abhidhamma Pitakas, expanding on prior work focused solely on the Sutta Pitaka. The corpus comprises 134,831 segments, including 114,591 from Bhikkhu Sujato's Sutta Pitaka, 7,923 from Bhikkhu Brahmali's Vinaya, 2,826 from I.B. Horner's 1938 Vinaya, 2,077 from three Abhidhammattha Sangaha translations, and cross-tradition Vinaya texts. Methods included Zipf rank-frequency distributions, Moving Average TTR (MATTR-500), numeral-word density, and vocabulary overlap. Key findings indicate all corpora show Zipf-consistent distributions (R2 > 0.989), with the Vinaya closest to an ideal slope of -1. Lexical diversity (MATTR-500) is similar for Sutta (0.399) and Theravada Vinaya (0.400) but higher for Sangaha (0.560), which also has the highest numeral-word density (3.26%). Cross-tradition Vinaya texts share 20.0% Jaccard vocabulary with Theravada Vinaya, while two English translations of the same Vinaya text, 88 years apart, share only 24.2% vocabulary, revealing significant semantic shifts.

Key takeaway

For computational linguists or digital humanities researchers analyzing historical or translated corpora, this study demonstrates how quantitative stylometric methods can reveal subtle yet significant differences in vocabulary, lexical diversity, and translation evolution across large textual datasets. You should consider applying these techniques (Zipf distributions, MATTR, vocabulary overlap) to your own projects to uncover hidden linguistic patterns or track semantic shifts over time, especially when dealing with multiple translations or versions of a source text.

Key insights

Computational stylometry reveals distinct lexical patterns and significant translation shifts across Buddhist canonical texts.

Principles

Method

The analysis employs Zipf rank-frequency distributions, Moving Average TTR (MATTR-500), numeral-word density, and vocabulary overlap using Jaccard and Szymkiewicz-Simpson coefficients.

In practice

Topics

Best for: AI Scientist, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.