The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

2026-06-01 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Content Integrity · Depth: Expert, medium

Summary

A study by Michał Brzozowski and Neo Christopher Chung reveals that large language models (LLMs) generate not just high-probability individual names, but correlated character ensembles, such as "Elena Vasquez + Marcus Chen + Amara Okafor" for Claude, "Aris Thorne + Lena Petrova" for Gemini, and "Elara Voss" for GPT. These name priors are model-family and version-specific, showing consistent co-occurrence rates across independent generations. The research notes that these priors are actively suppressed at model release boundaries, leaving identifiable behavioral fingerprints. A significant downstream consequence is documented on Zenodo, a CERN-operated repository, where 1,655 "ghost-authored" records were found, claiming nonexistent journals with fabricated publication dates. DataCite timestamps confirm deliberate backdating, with 991 records registered in a single month, all carrying real DOIs. Additionally, ghost names appear on ResearchGate, forming synthetic research groups, with publication dates providing a temporal proxy for model deployment.

Key takeaway

For research scientists and academic publishers evaluating scholarly integrity, this research highlights a critical vulnerability: LLM-generated ghost authors are infiltrating repositories like Zenodo with fabricated publications. You should implement robust verification processes for new submissions, cross-referencing author names and journal details against established databases. Be wary of records with suspicious publication dates or correlated author ensembles, as these indicate potential AI-generated content that could undermine research credibility and data integrity.

Key insights

LLMs generate correlated fictional character ensembles, not just individual names, leading to widespread ghost authorship in academic repositories.

Principles

LLM name generation exhibits model-family and version-specific correlations.
Model release boundaries suppress prior generation behaviors.
Fictional LLM-generated entities can propagate into real scholarly infrastructure.

Method

The study identified correlated name priors by analyzing independent LLM generations and traced their downstream impact by scanning Zenodo and ResearchGate for ghost-authored records and fabricated publication metadata.

In practice

Monitor academic repositories for correlated fictional author names.
Use DataCite timestamps to detect backdated or fabricated records.
Analyze publication dates of ghost-authored content for LLM deployment proxies.

Topics

Large Language Models
Ghost Authorship
Academic Integrity
Zenodo Repository
DataCite DOIs
AI Misinformation

Code references

tinierZhao/Academic-Industrial-associations

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.