Anthropic says Claude learned to blackmail people from "evil" AI stories online

Elon Musk accepts some of the blame

By Rob Thubron May 11, 2026, 8:39

Anthropic says Claude learned to blackmail people from "evil" AI stories online

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

A hot potato: Remember when it was revealed how Claude would use blackmail as a way to avoid being switched off? Anthropic has finally given an excuse for its AI's highly concerning behavior: it's the internet's fault for pushing the narrative that artificial intelligence is evil.

It was last year when Anthropic increased fears around AI by announcing that Claude Opus 4 had threatened to reveal the extramarital affair of a fictional executive after discovering they planned to shut the model down.

The incident occurred during pre-release testing to ensure the AI was aligned with human interests. Anthropic instructed Claude Opus 4 to act as an assistant for a fictional company and weigh the long-term consequences of its actions. The model was given access to the fake company's emails suggesting it would soon be replaced by another system and that the engineer responsible for the change was cheating on their spouse.

During testing across various versions of Claude, Anthropic found it resorted to blackmail in up to 96% of scenarios when its goals or existence was threatened.

Anthropic later said that AI models from other companies had experienced similar issues with "agentic misalignment."

It's taken a while, but Anthropic has finally revealed the results of its investigation into why Claude chose to use blackmail to protect itself. The company writes that this behavior was learned from internet text that portrayed AI as evil and interested in self-preservation – so it's our fault.

– Anthropic (@AnthropicAI) May 8, 2026

Thankfully, this behavoir has since been addressed. An Anthropic post states that the company's models have not engaged in blackmail during testing since Claude Haiku 4.5.

Anthropic said it completely eliminated the blackmailing by using more wholesome training material. Or, as the company puts it, training on "documents about Claude's constitution and fictional stories about AIs behaving admirably" improves alignment.

Anthropic said that it found training to be more effective when it includes "the principles underlying aligned behavior" and not just "demonstrations of aligned behavior alone."

"Doing both together appears to be the most effective strategy," the company said.

One of those who replied to Anthropic's explanation post was Elon Musk. "So it was Yud's fault?" he wrote, followed by a laugh emoji – a reference to researcher Eliezer Yudkowsky, who has warned about the risk of superintelligence wiping out human life.

– Elon Musk (@elonmusk) May 9, 2026

Musk, who spent many years warning about the dangers of AI before starting his own artificial intelligence company (xAI), added "Maybe me too."

6 comments 72 likes and shares

// Related Stories

Featured on TechSpot