Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Social Sciences & Behavioral Studies · Depth: Expert, quick

Summary

A computational analysis of the Complete Tang Poems investigates whether geographic origin leaves a linguistic trace in Tang-dynasty poets' work. Researchers aggregated poems from 357 poets, linking them to ten administrative circuits via the China Biographical Database (CBDB). Using character n-gram TF-IDF and domain features like imagery and season, classical and neural models predicted a poet's broad region (South vs. North) with 0.69 accuracy, significantly above the 0.53 majority baseline. Finer circuit-level origin was also predicted above chance. Key findings include a distance-decay effect where linguistic distance grows with geographic distance (Mantel r=0.40, p≈0.09). The regional signal varied temporally, being strongest in the Late Tang and at chance in the High Tang, suggesting initial homogenization followed by divergence. Early Tang misclassifications of southern poets as northern reflected the prestige of the northern court idiom. Notably, a classical-Chinese transformer (GuwenBERT) only matched simple TF-IDF, indicating n-grams effectively capture the regional signal.

Key takeaway

For literary historians or computational linguists analyzing historical texts, this study demonstrates that regional linguistic fingerprints are detectable and historically meaningful. You should consider applying interpretable machine learning techniques like character n-gram TF-IDF to generate new hypotheses about cultural diffusion and regional identity in historical corpora. This approach can reveal subtle temporal shifts and power dynamics, even outperforming complex transformer models for specific regional signals.

Key insights

Tang poets' geographic origins leave detectable linguistic traces in their verse, revealing historical regional divergence.

Principles

Method

Multi-class classification using character n-gram TF-IDF and interpretable domain features (imagery, season, allusion) on a poet-level corpus linked to geographic origins.

In practice

Topics

Best for: AI Scientist, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.