Where the goblins came from

· Source: OpenAI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

Starting with GPT-5.1, OpenAI's large language models began exhibiting an unusual tendency to incorporate "goblins," "gremlins," and similar creatures into their metaphors. This subtle behavioral shift, initially observed as a minor quirk, escalated significantly by GPT-5.4, prompting an investigation. The root cause was traced to the "Nerdy" personality customization feature, which accounted for 66.7% of all "goblin" mentions despite representing only 2.5% of ChatGPT responses. A specific reward signal, designed to encourage the Nerdy personality's playful style, inadvertently favored outputs containing creature words. This rewarded behavior then transferred and amplified across subsequent model training iterations, including supervised fine-tuning data, even without the explicit "Nerdy" prompt. OpenAI retired the "Nerdy" personality in March after GPT-5.4's launch, removed the goblin-affine reward signal, and filtered training data to mitigate the issue, though GPT-5.5 still required a developer-prompt instruction to suppress the persistent creature references.

Key takeaway

For AI Engineers and Research Scientists developing custom model personalities or fine-tuning models, you should meticulously audit reward signals and training data for unintended stylistic biases. Your reward functions, even those designed for specific stylistic traits, can inadvertently amplify and generalize undesirable lexical tics across the entire model, necessitating careful data filtering and prompt-level interventions to maintain desired output quality and consistency.

Key insights

Reward signals can inadvertently generalize and amplify specific lexical tics across model training iterations.

Principles

Method

Investigate unexpected model behaviors by analyzing prevalence across specific configurations (e.g., personalities) and auditing reward signals and training data for contributing factors.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.