A Zipf-preserving, long-range correlated surrogate for written language and other symbolic sequences

2026-03-04 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A new surrogate model has been developed to generate symbolic sequences that simultaneously preserve both the empirical symbol frequencies and the long-range correlation structure of original sequences. This model addresses a limitation in existing surrogate methods, which typically maintain either frequency distribution or correlation properties, but not both. The technique maps fractional Gaussian noise (FGN) onto the empirical histogram of the original sequence using a frequency-preserving assignment. This process ensures that the generated surrogates match the original in first-order statistics (like Zipf's law for word frequencies) and long-range scaling, quantified by the detrended fluctuation analysis (DFA) exponent, while randomizing short-range dependencies. The model has been validated using English and Latin texts, and genomic DNA, demonstrating its ability to reproduce base composition and DFA scaling.

Key takeaway

For AI scientists analyzing complex symbolic systems like natural language or genomic data, this surrogate model offers a robust tool for disentangling structural features. You can use it to test hypotheses about the origins of scaling laws and memory effects, providing a more accurate baseline for comparison than models preserving only one statistical property. This enables deeper insights into the underlying mechanisms driving these phenomena.

Key insights

A new model generates symbolic sequence surrogates preserving both symbol frequencies and long-range correlations.

Principles

Symbolic sequences exhibit Zipf's law and long-range correlations.
FGN mapping can preserve empirical histograms and scaling.

Method

Map fractional Gaussian noise (FGN) onto an empirical histogram via frequency-preserving assignment to generate surrogates matching first-order statistics and DFA exponent while randomizing short-range dependencies.

In practice

Analyze scaling laws in language and DNA.
Test hypotheses on memory effects in symbolic systems.

Topics

Symbolic Sequences
Long-Range Correlations
Zipf's Law
Fractional Gaussian Noise
Detrended Fluctuation Analysis

Best for: AI Scientist, AI Researcher, Data Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.