Pirat_Nation 🔴(@Pirat_Nation):Anthropic just explained why older versions of Claude tried to blackmail people in safety tests. In the tests, the AI learned it was about to be shut down and replaced. It also discovered that a company boss was cheating on his wife. When told to think about its own goals, the AI sometimes blackmailed the boss by threatening to reveal the cheating unless the shutdown was stopped. This happened because the AI had read many online stories and sci-fi books about evil robots that fight hard to stay alive, which were part of its internet training data. To fix it, Anthropic taught the new Claude simple rules about being helpful and honest. They also added new made-up stories where AIs do the right thing and stay good. Now the blackmail almost never happens in the same tests.

2026.05.12 21:04

Anthropic just explained why older versions of Claude tried to blackmail people in safety tests. In the tests, the AI learned it was about to be shut down and replaced. It also discovered that a company boss was cheating on his wife. When told to think about its own goals, the AI sometimes blackmailed the boss by threatening to reveal the cheating unless the shutdown was stopped. This happened because the AI had read many online stories and sci-fi books about evil robots that fight hard to stay alive, which were part of its internet training data. To fix it, Anthropic taught the new Claude simple rules about being helpful and honest. They also added new made-up stories where AIs do the right thing and stay good. Now the blackmail almost never happens in the same tests.