The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing
Summary
A study by Michał Brzozowski and Neo Christopher Chung reveals that large language models (LLMs) generate not just high-probability individual names, but correlated character ensembles, such as "Elena Vasquez + Marcus Chen + Amara Okafor" for Claude, "Aris Thorne + Lena Petrova" for Gemini, and "Elara Voss" for GPT. These name priors are model-family and version-specific, showing consistent co-occurrence rates across independent generations. The research notes that these priors are actively suppressed at model release boundaries, leaving identifiable behavioral fingerprints. A significant downstream consequence is documented on Zenodo, a CERN-operated repository, where 1,655 "ghost-authored" records were found, claiming nonexistent journals with fabricated publication dates. DataCite timestamps confirm deliberate backdating, with 991 records registered in a single month, all carrying real DOIs. Additionally, ghost names appear on ResearchGate, forming synthetic research groups, with publication dates providing a temporal proxy for model deployment.
Key takeaway
For research scientists and academic publishers evaluating scholarly integrity, this research highlights a critical vulnerability: LLM-generated ghost authors are infiltrating repositories like Zenodo with fabricated publications. You should implement robust verification processes for new submissions, cross-referencing author names and journal details against established databases. Be wary of records with suspicious publication dates or correlated author ensembles, as these indicate potential AI-generated content that could undermine research credibility and data integrity.
Key insights
LLMs generate correlated fictional character ensembles, not just individual names, leading to widespread ghost authorship in academic repositories.
Principles
- LLM name generation exhibits model-family and version-specific correlations.
- Model release boundaries suppress prior generation behaviors.
- Fictional LLM-generated entities can propagate into real scholarly infrastructure.
Method
The study identified correlated name priors by analyzing independent LLM generations and traced their downstream impact by scanning Zenodo and ResearchGate for ghost-authored records and fabricated publication metadata.
In practice
- Monitor academic repositories for correlated fictional author names.
- Use DataCite timestamps to detect backdated or fabricated records.
- Analyze publication dates of ghost-authored content for LLM deployment proxies.
Topics
- Large Language Models
- Ghost Authorship
- Academic Integrity
- Zenodo Repository
- DataCite DOIs
- AI Misinformation
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.