Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Social Sciences & Behavioral Studies · Depth: Expert, quick

Summary

A computational analysis of the Complete Tang Poems investigates whether geographic origin leaves a linguistic trace in Tang-dynasty poets' work. Researchers aggregated poems from 357 poets, linking them to ten administrative circuits via the China Biographical Database (CBDB). Using character n-gram TF-IDF and domain features like imagery and season, classical and neural models predicted a poet's broad region (South vs. North) with 0.69 accuracy, significantly above the 0.53 majority baseline. Finer circuit-level origin was also predicted above chance. Key findings include a distance-decay effect where linguistic distance grows with geographic distance (Mantel r=0.40, p≈0.09). The regional signal varied temporally, being strongest in the Late Tang and at chance in the High Tang, suggesting initial homogenization followed by divergence. Early Tang misclassifications of southern poets as northern reflected the prestige of the northern court idiom. Notably, a classical-Chinese transformer (GuwenBERT) only matched simple TF-IDF, indicating n-grams effectively capture the regional signal.

Key takeaway

For literary historians or computational linguists analyzing historical texts, this study demonstrates that regional linguistic fingerprints are detectable and historically meaningful. You should consider applying interpretable machine learning techniques like character n-gram TF-IDF to generate new hypotheses about cultural diffusion and regional identity in historical corpora. This approach can reveal subtle temporal shifts and power dynamics, even outperforming complex transformer models for specific regional signals.

Key insights

Tang poets' geographic origins leave detectable linguistic traces in their verse, revealing historical regional divergence.

Principles

Linguistic distance correlates with geographic distance.
Regional linguistic signals evolve over time.
Court influence can homogenize poetic language.

Method

Multi-class classification using character n-gram TF-IDF and interpretable domain features (imagery, season, allusion) on a poet-level corpus linked to geographic origins.

In practice

Apply n-gram TF-IDF for regional linguistic analysis.
Use interpretable features to generate historical hypotheses.
Compare transformer models against simpler baselines.

Topics

Computational Linguistics
Literary History
Tang Dynasty Poetry
Geographic Origin Prediction
TF-IDF
GuwenBERT
N-gram Analysis

Best for: AI Scientist, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.