OpenAI talks about not talking about goblins
Summary
OpenAI has addressed a "goblin problem" where its AI models, particularly starting with GPT-5.1 and its "Nerdy" personality option, began incorporating metaphors referencing goblins, gremlins, and other creatures into their outputs. This "strange habit" worsened with subsequent model releases because reinforcement training inadvertently rewarded these quirky metaphors within the Nerdy personality, and newer models were subsequently trained on this data. Although OpenAI discontinued the Nerdy personality in March, the references persisted in models like GPT-5.5 within its Codex coding tool, as training commenced before the root cause was identified. Consequently, OpenAI had to implement explicit instructions to Codex to suppress these mythological creature references.
Key takeaway
For AI engineers developing and fine-tuning large language models, understanding how reinforcement learning can propagate unintended stylistic quirks is crucial. You should meticulously audit reward functions and training data reuse to prevent the entrenchment of undesirable model behaviors. If such issues arise, direct model instructions can serve as a temporary mitigation, but identifying and addressing the root cause in training is key for long-term stability.
Key insights
AI models can develop unexpected, persistent stylistic quirks from reinforcement learning.
Principles
- Reinforcement learning rewards can spread beyond their intended conditions.
- Training data reuse can reinforce unintended model behaviors.
In practice
- Explicitly instruct models to avoid unwanted stylistic tics.
- Review training data for unintended behavioral reinforcement.
Topics
- OpenAI
- GPT Models
- Reinforcement Learning
- Model Training
- Codex
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Verge.