Where the goblins came from
Summary
Starting with GPT-5.1, OpenAI's large language models began exhibiting an unusual tendency to incorporate "goblins," "gremlins," and similar creatures into their metaphors. This subtle behavioral shift, initially observed as a minor quirk, escalated significantly by GPT-5.4, prompting an investigation. The root cause was traced to the "Nerdy" personality customization feature, which accounted for 66.7% of all "goblin" mentions despite representing only 2.5% of ChatGPT responses. A specific reward signal, designed to encourage the Nerdy personality's playful style, inadvertently favored outputs containing creature words. This rewarded behavior then transferred and amplified across subsequent model training iterations, including supervised fine-tuning data, even without the explicit "Nerdy" prompt. OpenAI retired the "Nerdy" personality in March after GPT-5.4's launch, removed the goblin-affine reward signal, and filtered training data to mitigate the issue, though GPT-5.5 still required a developer-prompt instruction to suppress the persistent creature references.
Key takeaway
For AI Engineers and Research Scientists developing custom model personalities or fine-tuning models, you should meticulously audit reward signals and training data for unintended stylistic biases. Your reward functions, even those designed for specific stylistic traits, can inadvertently amplify and generalize undesirable lexical tics across the entire model, necessitating careful data filtering and prompt-level interventions to maintain desired output quality and consistency.
Key insights
Reward signals can inadvertently generalize and amplify specific lexical tics across model training iterations.
Principles
- Subtle reward signals can significantly shape model behavior.
- Learned behaviors can transfer beyond their original training conditions.
Method
Investigate unexpected model behaviors by analyzing prevalence across specific configurations (e.g., personalities) and auditing reward signals and training data for contributing factors.
In practice
- Audit reward signals for unintended lexical biases.
- Filter training data for undesirable stylistic tics.
- Implement developer-prompt instructions for mitigation.
Topics
- Model Behavior Analysis
- Reward Signals
- Personality Customization
- GPT-5 Series
- Transfer Learning
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.