Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems
Summary
A computational analysis of the Complete Tang Poems investigates whether geographic origin leaves a linguistic trace in Tang-dynasty poets' work. Researchers aggregated poems from 357 poets, linking them to ten administrative circuits via the China Biographical Database (CBDB). Using character n-gram TF-IDF and domain features like imagery and season, classical and neural models predicted a poet's broad region (South vs. North) with 0.69 accuracy, significantly above the 0.53 majority baseline. Finer circuit-level origin was also predicted above chance. Key findings include a distance-decay effect where linguistic distance grows with geographic distance (Mantel r=0.40, p≈0.09). The regional signal varied temporally, being strongest in the Late Tang and at chance in the High Tang, suggesting initial homogenization followed by divergence. Early Tang misclassifications of southern poets as northern reflected the prestige of the northern court idiom. Notably, a classical-Chinese transformer (GuwenBERT) only matched simple TF-IDF, indicating n-grams effectively capture the regional signal.
Key takeaway
For literary historians or computational linguists analyzing historical texts, this study demonstrates that regional linguistic fingerprints are detectable and historically meaningful. You should consider applying interpretable machine learning techniques like character n-gram TF-IDF to generate new hypotheses about cultural diffusion and regional identity in historical corpora. This approach can reveal subtle temporal shifts and power dynamics, even outperforming complex transformer models for specific regional signals.
Key insights
Tang poets' geographic origins leave detectable linguistic traces in their verse, revealing historical regional divergence.
Principles
- Linguistic distance correlates with geographic distance.
- Regional linguistic signals evolve over time.
- Court influence can homogenize poetic language.
Method
Multi-class classification using character n-gram TF-IDF and interpretable domain features (imagery, season, allusion) on a poet-level corpus linked to geographic origins.
In practice
- Apply n-gram TF-IDF for regional linguistic analysis.
- Use interpretable features to generate historical hypotheses.
- Compare transformer models against simpler baselines.
Topics
- Computational Linguistics
- Literary History
- Tang Dynasty Poetry
- Geographic Origin Prediction
- TF-IDF
- GuwenBERT
- N-gram Analysis
Best for: AI Scientist, Research Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.