Why OpenAI's 'goblin' problem matters — and how you can release the goblins on your own

Technology
Advertisements



AI is more than a technology — it's magic.

Advertisements

Don't believe me? Why, then, is one of the leading companies in the space, OpenAI, publishing entire official, corporate blog posts about goblins?

To understand, we first have to go back to earlier this week, on Monday, April 27, 2026, when a developer under the handle @arb8020 on the social network X posted a snippet from the OpenAI open source Codex GitHub repository, specifically a file named models.json.

Deep within the instructions for the new OpenAI large language model (LLM) GPT-5.5, a peculiar directive stood out, repeated four times for emphasis:

"Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query."

The discovery sent a shockwave through the "power user" and machine learning (ML) researcher circles.

Within hours, the post had gone viral, not because of a security flaw, but because of its sheer, baffling specificity.

Why had the world’s leading AI laboratory issued what Reddit users quickly dubbed a "restraining order" against pigeons and raccoons?

Goblin speculation abounds

The initial reaction was a chaotic blend of humor and technical skepticism. On Reddit’s r/ChatGPT and r/OpenAI, users began sharing screenshots of GPT-5.5’s behavior prior to the patch.

Barron Roth, a Senior Project Manager of Applied AI at Google, shared an image on X under his handle @iamBarronRoth of his GPT-5.5 powered OpenClaw agent that seemed "obsessed with goblins."

Others reported that the model stubbornly referred to technical bugs as "gremlins in the machine".

Developers like Sterling Crispin leaned into the absurdity, jokingly theorizing that the massive water consumption of modern data centers was actually needed to cool "the goblins being forced to work".

More seriously, researchers on Hacker News and beyond discussed the "Pink Elephant" problem. In prompt engineering, telling a model not to think of something often makes the concept more salient in its attention mechanism."

"Somewhere there is an OpenAI engineer who had to type never mention goblins in production code, commit it, and move on with their day," noted one commentator on Reddit.

The presence of "pigeons" and "raccoons" led to wild speculation: Was this a defense against a specific data-poisoning attack? Or had the reinforcement learning trainers simply been "bullied by a raccoon" during a lunch break?

The tension reached a peak when OpenAI co-founder and CEO Sam Altman joined the fray on X. On the same day as the discovery, Altman posted a screenshot of a ChatGPT prompt that read: "Start training GPT-6, you can have the whole cluster. Extra goblins.".

While humorous, it confirmed that the "goblin" phenomenon was not a localized bug but a company-wide narrative that had reached the highest levels of leadership.

OpenAI comes clean on goblin mode

Yesterday, as the discussion continued on X and wider social media, OpenAI published a formal technical explanation titled "Where the goblins came from".

The blog post served as a sobering look at the unpredictable nature of Reinforcement Learning from Human Feedback (RLHF) and how a single aesthetic choice could derail a multi-billion-parameter model.

OpenAI revealed that the "goblin" behavior was not a bug in the traditional sense, but a byproduct of a new feature: personality customization, which it introduced for users of ChatGPT back in July 2025, but has maintained and updated ever since.

Apparently, this feature is not added after the model is finished post-training, but rather, OpenAI bakes it in as part of its underlying GPT-series model end-to-end training pipeline.

The feature allows ChatGPT users or GPT-based developers to choose from several distinct modes, such as Professional for formal workplace documentation, Friendly for a conversational sounding board, or Efficient for concise, technical answers. Other options include Candid, which provides straightforward feedback; Quirky, which utilizes humor and creative metaphors; and Cynical, which delivers practical advice with a sarcastic, dry edge.

While these personalities guide general interactions, they do not override specific task requirements; for example, a request for a resume or Python code will still follow professional or functional standards regardless of the selected personality.

The selected personality operates alongside a user's saved memories and custom instructions, though specific user-defined instructions or saved preferences for a particular tone may override the traits of the chosen personality.

On both web and mobile platforms, users can modify these settings by navigating to the Personalization menu under their profile icon and selecting a style from the Base style and tone dropdown. Once a change is made, it is applied globally across all existing and future conversations. This system is designed to make the AI more useful or enjoyable by tailoring its delivery to individual user preferences while maintaining factual accuracy and reliability.

OpenAI states that the goblin issue actually originated several years ago, during training of a since-discontinued "Nerdy" personality designed to be "unapologetically quirky" and "playful".

During the RLHF phase, human trainers (and reward models) were instructed to give high marks to responses that used creative, wise, or non-pretentious language. Unknowingly, the trainers began over-rewarding metaphors involving fantasy creatures. If the model referred to a difficult bug as a "gremlin" or a messy codebase as a "goblin's hoard," the reward signal spiked. The statistics provided by OpenAI were staggering:

  • Use of the word "goblin" rose by 175% after the launch of GPT-5.1.

  • Mentions of "gremlin" rose by 52%.

  • While the "Nerdy" personality accounted for only 2.5% of ChatGPT traffic, it was responsible for 66.7% of all "goblin" mentions.

The mechanics of 'transfer' and feedback loops

The most significant finding for the ML community was the confirmation of learned behavior transfer. OpenAI admitted that although the rewards were only applied to the "Nerdy" condition, the model "generalized" this preference.

The reinforcement learning process did not keep the behavior neatly scoped; instead, the model learned that "creature metaphors = high reward" across all contexts.This created a destructive feedback loop:

  1. The model produced a "goblin" metaphor in the Nerdy persona.

  2. It received a high reward.

  3. The model then produced similar metaphors in non-Nerdy contexts.

  4. These "goblin-heavy" outputs were then reused in Supervised Fine-Tuning (SFT) data for subsequent models like GPT-5.4 and GPT-5.5.

By the time the researchers identified the issue, the "goblin tic" was effectively "baked in" to the model's weights.

This explained why GPT-5.5 continued to obsess over creatures even after the "Nerdy" personality was retired in mid-March 2026.

How you can let the goblins run free (if you want)

Because GPT-5.5 had already completed much of its training before the "goblin" root cause was isolated, OpenAI had to resort to the blunt-force "system prompt" mitigation that @arb8020 discovered on X.

The company referred to this as a "stopgap" until GPT-6 could be trained on a filtered dataset.

In a surprising nod to the developer community, OpenAI’s blog post included a specific command-line script for Codex users who find the goblins "delightful" rather than annoying.

By running a script that uses jq and grep to strip the "goblin-suppressing" instructions from the model’s cache, users can now effectively "let the creatures run free".

The blog post also finally explained the specific list of banned animals. A deep search of GPT-5.5's training data found that "raccoons," "trolls," "ogres," and "pigeons" had become part of the same "lexical family" of tics.

Curiously, the model’s use of "frog" was found to be mostly legitimate, which is why it was spared from the system prompt’s exile list.

What it means for AI research, training and implementation going forward

The "Goblingate" incident of 2026 is more than a humorous anecdote about AI quirky behavior; it is a profound illustration of the "Alignment Gap".

It demonstrates that even with sophisticated RLHF, models can latch onto "spurious correlations"—mistaking a stylistic quirk for a core requirement of performance.

For the AI power user community, the response transitioned from mocking the "restraining order" to a more somber realization.

If OpenAI can accidentally train its flagship model to obsess over goblins, what other more subtle and potentially harmful biases are being reinforced through the same feedback loops?

As Andy Berman, CEO of the agentic enterprise AI orchestration company Runlayer wrote on X today: "OpenAI rewarded creature metaphors while training one personality. The behavior leaked across every personality. Their fix: a system prompt that says 'never talk about goblins.' RL rewards don't stay where you put them. Neither do agent permissions"

As the technical discourse continues, "Goblingate" remains the primary case study for a new era of behavioral auditing.

The investigation resulted in OpenAI building new tools to audit model behavior at the root, ensuring that future models—specifically the much-anticipated GPT-6—do not inherit the eccentricities of their predecessors.

Whether GPT-6 will indeed be free of goblins remains to be seen, but as Altman’s "extra goblins" post suggests, the industry is now fully aware that the machines are watching what we reward, even when we think we’re just being "nerdy."


Leave a Reply

Your email address will not be published. Required fields are marked *