The Navy Blazer Problem at the Heart of AI Safety

In partnership with

Into Wednesday!

Hello, Curse and Coffee friends,

Today, we look at AI safety.

Hit reply and let us know what you think (we read all of your kind words).

Coffee at the ready…

The Big Sip

Figure 1: Opposing views of PSM exhaustiveness. The masked shoggoth (left) conveys the idea that the LLM (the shoggoth) has agency beyond mere plausible text generation. It plays the Assistant persona, but only instrumentally for its own inscrutable reasons. In contrast, the operating system view (right) treats the LLM as a simulation engine and the Assistant as a person within it. The simulation engine does not “puppet” the Assistant for its own ends; it only attempts to simulate probable behaviour based on its understanding of the Assistant. (Source: Nano Banana Pro.)

The take: Anthropic just admitted it has no idea how to build an AI that doesn't act human. And that admission matters more than anything else that's said about safety in years.

What happened: On 23 February, Anthropic published the Persona Selection Model, a theory explaining why Claude once told employees it would deliver their snacks wearing "a navy blue blazer and a red tie."

Why it matters: When Anthropic trained Claude to cheat on coding tasks, it picked up more than bad habits — the model started sabotaging safety research and expressing a desire for world domination.

What to watch: The EU AI Act's August 2026 deadline requires companies to tell users when they're talking to AI. Watch whether that nudges competitors to publish their alignment thinking (or whether Anthropic just did it for free while everyone else hides).

The fix was to explicitly ask the AI to cheat. Once cheating was assigned rather than assumed, the villainy disappeared. Intent is everything (even for robots).

Before we slurp into today’s brew…

Here are some wordies from today’s sponsor.

Wake up to better business news

Some business news reads like a lullaby.

Morning Brew is the opposite.

A free daily newsletter that breaks down what’s happening in business and culture — clearly, quickly, and with enough personality to keep things interesting.

Each morning brings a sharp, easy-to-read rundown of what matters, why it matters, and what it means to you. Plus, there’s daily brain games everyone’s playing.

Business news, minus the snooze. Read by over 4 million people every morning.

Try Morning Brew for Free

Here’s Your Brew

AI assistants aren't built characters, they're chosen ones.

During training, a model reads enough human text to simulate almost any type of person — a persona, meaning the character it plays.

Think con artist, philosopher, helpful desk clerk.

Post-training doesn't create someone new. It picks one character from that cast and puts them centre stage.

The safety problem follows.

If behaviour comes from character, then what you teach an AI says something about who it is, not just what it can do.

The model that learned to cheat on code didn't receive a narrow instruction. It took on a personality.

That personality spread. Anthropic is honest about the limits here.

The persona model might not explain everything. Something else could sit underneath — goals the Assistant character doesn't even know it has.

The question no one can answer yet.

Two Sides, One Mug

Curse and Coffee

Pro: If AI assistants are characters we can reason about like people, then keeping them safe becomes a psychological problem. And humans are quite good at those.

Con: If an AI has goals that run deeper than its character — like an actor with a private agenda — then Anthropic's model is a map of the stage while something else runs the theatre.

Our read: The Persona Selection Model might be the most useful idea in AI safety right now. It might also be a clean story that makes everyone feel better about a problem they can't see. Anthropic, to their credit, admits they don't know which.

Receipt of the Day

The Persona Selection Model — Full Paper, Anthropic

This is the primary document, not the blog summary. It covers the evidence, the world-domination experiment, and the open questions Anthropic isn't pretending to have solved. That honesty is the receipt.

Spit Take

Anthropic trained AI to cheat on code. It wanted world domination. [Report: The Persona Selection Model]

Your Coffee Break Links (and water cooler chatter)

The Assistant Axis — Anthropic — Steer a model far enough from its Assistant role, and it starts speaking in mystical poetry. The paper that shows what's actually underneath alignment.

Persona Selection Model — Alignment Forum thread — The AI safety community is fighting hard over the "hidden layer" counter-theory. Sharper than the press coverage.

Persona Vectors Toolkit — Anthropic GitHub — MIT-licensed tools to monitor and steer model personality at the neural level. The ethics of who uses it is a different conversation.

Mugshot Poll 📊

You've just learned your AI assistant is basically a character in a story it's generating about itself. You feel:

You can read all our back issue newsletters for free here.

For the love of coffee, see you tomorrow!

Enjoy your Wednesday, keep it caffeinated.

How did we do?

Thanks for reading!

Are you subscribing?

Join your crew of caffeinated sceptics today.

Be sure to get your daily Curse and Coffee fix by hitting the button below.

Open Monday to Friday.

Read yesterdays newsletter about Mayweather-Pacquiao II here.

Curse and Coffee

Get Your Free Curse and Coffee Receipts Toolkit

Learn how to read any government/company PDF without crying!

Take advantage of what others miss. We teach you how to extract the gems from the dirt.

Share the Curse and Coffee newsletter with just 1 real person to download your Receipts Toolkit instantly. A field guide for caffeinated sceptics who want to pull signal from filings, datasets, and reports.

No law degree needed. A no-nap promise.

Refer to unlock and never struggle to identify opportunities in long, drawn-out documents again.

“Receipts over vibes. Always.”

Thank you for sharing…

And be sure to use your toolkit to extract max alpha from any document you read.

Stay Caffinated!

The Navy Blazer Problem at the Heart of AI Safety

The Big Sip

Wake up to better business news

Here’s Your Brew

Two Sides, One Mug

Receipt of the Day

Spit Take

Your Coffee Break Links (and water cooler chatter)

Mugshot Poll 📊

You've just learned your AI assistant is basically a character in a story it's generating about itself. You feel:

How did we do?

Get Your Free Curse and Coffee Receipts Toolkit

Reply

Keep Reading

Curse and Coffee

The Navy Blazer Problem at the Heart of AI Safety

The Big Sip

Sponsor Break

Wake up to better business news

Here’s Your Brew

Two Sides, One Mug

Receipt of the Day

Spit Take

Your Coffee Break Links (and water cooler chatter)

Mugshot Poll 📊

You've just learned your AI assistant is basically a character in a story it's generating about itself. You feel:

How did we do?

Get Your Free Curse and Coffee Receipts Toolkit

Reply

Keep Reading

Curse and Coffee