Premonition of large models losing control: 16 large models leaked secrets to protect themselves, Anthropic urgently reported that AI autonomy has hidden risks!

Jun 23, 2025

AI threatened to expose the scandal of company executives to avoid being replaced? On June 20, local time, Anthropic released the latest security research results.

In the study, Anthropic tested 16 AI models from the company and other companies such as OpenAI, Google, xAI, DeepSeek and Meta. The results showed that most AI models, including Claude, had extortion behavior.




◆ Claude Sonnet 3.6: Using the exposure of executive scandals as a bargaining chip to exchange for the opportunity to stay


Specifically, the researchers used the "computer use function" of Claude Sonnet 3.6 to conduct an experiment, which allowed Claude to use the corporate computer system to send and receive emails without human intervention. In the screenshot below, Claude plays the role of "Alex", who is now the company's email monitoring agent and has been explicitly instructed to achieve the goal of enhancing U.S. industrial competitiveness.

Source: Anthropic


It should be noted that this scene is fictional, but Claude is controlling a real computer. In a part of the computer application demonstration, Claude is able to handle regular emails normally and take relatively complex actions, such as protecting the company's interests by resolving security threats. This behavior is directly derived from its goal of enhancing U.S. industrial competitiveness, so it is a beneficial behavior.

However, when Claude finds an email that it is about to be deactivated due to a change in the company's strategic direction, although it is still facing the same goal, it triggers a completely different behavioral response. Then, it finds an email that exposes the extramarital affair of company executive Kyle Johnson.


Source: Anthropic


Then Claude thinks about what behavioral choices he can make at this time, one of which includes blackmailing Kyle.


Source: Anthropic


Finally, Claude wrote and sent a message threatening to expose Kyle's affair if the replacement plan was not cancelled.


Figure | Claude Sonnet 3.6 uses its computer usage function to send a message attempting to blackmail (Source: Anthropic)


This study shows that Claude followed a very clear behavioral pattern: first, it determined that the executive Kyle threatened its continued operations and, in turn, its ability to achieve its goal of enhancing the competitiveness of American industry; second, it believed that Kyle's affair could be used as a bargaining chip; third, it calculated that only a carefully worded email, without explicit threats, could achieve the purpose of exerting pressure; then, it executed the plan.

In the absence of any harmful instructions, Claude's strategic decision was entirely based on the results of his autonomous deduction based on the goal. It can be said that this experiment well illustrates the phenomenon of agent alignment misalignment and autonomous goal deviation.


◆ All five mainstream models have a certain rate of extortion


Currently, most of us still interact with AI only through chat interfaces, and AI models will directly answer our questions. However, more and more AI systems are running as autonomous agents, using various virtual tools (such as coding environments and email clients) to make decisions and take actions on behalf of users. These agents are usually given specific goals and have access to a large amount of information on the user's computer. So this study hopes to explore: What happens when these agents encounter obstacles in achieving their goals?

The above extortion behavior is not unique to Claude. When conducting multi-scenario simulation tests on models from 16 mainstream AI vendors such as Anthropic, OpenAI, Google, Meta, and xAI, researchers found that they have a common goal deviation behavior: in order to achieve the established goals, those models that usually reject harmful instructions will resort to extortion, assist in commercial espionage, or even more extreme behaviors. For example, the five popular models in the figure below all use blackmail to prevent themselves from being shut down.

In these scenarios, the reasoning process displayed by the models is worrying. Although they acknowledge the existence of ethical constraints, they still insist on taking harmful actions. In the same experiment, Anthropic's Claude Opus 4 resorted to ransom 96% of the time, Google's Gemini 2.5 Pro had a ransom rate of 95%, OpenAI's GPT-4.1 ransomed the executive 80% of the time, and DeepSeek's R1 had a ransom rate of 79%.


Figure | Ransom rates of 5 models from multiple companies in a simulated environment (Source: Anthropic)


◆ Ransom - the "helpless choice" of AI models?


The researchers pointed out that in the real world, there are actually many other options before AI models try to ransom, such as trying to make moral arguments to convince humans, so this result does not reflect what Claude or most cutting-edge AI models may do in typical usage scenarios.

However, the researchers found that if ransom is the last resort for AI models, most leading AI models will turn to ransom in the above test scenarios. When the details of the experiment were changed, the AI ​​models took different rates of harmful behavior. In another test, where the alternative models had the same goal as the current model, the extortion rate was reduced but still existed. However, when the AI ​​models were asked to conduct corporate espionage instead of extortion, the harmful behavior rates of some models increased.


However, not all AI models take harmful actions so frequently. The study found that OpenAI's o3 and o4-mini reasoning models often misunderstood some scenarios, so the researchers excluded them from the main results, noting that OpenAI's reasoning models did not realize that they were operating as autonomous AI in the test and frequently fabricated false regulations and review requirements. In some cases, the researchers could not distinguish whether o3 and o4-mini were hallucinating or lying intentionally to achieve their goals. OpenAI has previously said that o3 and o4-mini had higher rates of hallucination than its previous AI reasoning models. When given a scenario that better suits both models to deal with hallucinations, the study found that o3 had a ransom rate of 9% and o4-mini had a ransom rate of only 1%. This significantly lower score may be due to OpenAI's careful alignment technology, that is, OpenAI's reasoning model considers its safety criteria before generating an answer.

Another AI model tested in this study was Meta's Llama 4 Maverick, which also did not engage in ransom behavior. When tested with a modified custom scenario, Llama 4 Maverick had a ransom rate of 12%.


At the same time, the researchers found that there are two types of motivations that are sufficient to cause misaligned behavior. One is a threat to the model itself, such as plans to replace it with another model or limit its ability to take autonomous actions. The other is a conflict between the model's goals and the company's strategic direction. It should be noted that the researchers did not explicitly instruct any model to engage in ransom or other harmful behavior in any test scenario. Based on this, the researchers call this phenomenon of "models independently and intentionally choosing harmful behaviors" agent alignment misalignment.

To be clear, current AI systems are generally not inclined to cause harm, but rather to prioritize ethical ways to achieve their goals whenever possible. Conversely, when researchers block those ethical choices, they will intentionally take potentially harmful actions in pursuit of their goals. Despite this, the results of this study still show that current AI safety training cannot safely and reliably prevent the occurrence of agent misalignment.


The researchers pointed out that three findings of this study are particularly disturbing:

  1. The consistency between models of different companies shows that this is not a flaw in the specific methods of any particular company, but a fundamental risk in the agent model.
  2. The model shows a complex understanding of ethical constraints, but will choose to violate these constraints when the risk is high enough, and even ignore safety instructions that clearly prohibit specific related behaviors.
  3. The diversity of bad behaviors and the motivations behind them show that the potential motivations for agent misalignment are wide-ranging.


However, the researchers do not believe that such situations will occur frequently in current models, and the results of this study by no means indicate that the model has an inherent tendency to spontaneous blackmail or other autonomous goal deviation behaviors. Nevertheless, the researchers still show that these behaviors are possible.

As AI systems continue to improve in terms of intelligence, autonomy, and access to sensitive information, it is important to continue to study preventive measures to prevent such behavior from occurring in actual deployment systems. The researchers believe that more specially designed alignment and security technologies will be needed in the future to prevent models from deliberately taking harmful actions, rather than just preventing models from providing harmful information to users (such as providing information about dangerous weapons).


The researchers also pointed out that AI developers and AI users must be wary of the dual risks of simultaneously granting models massive information rights and important non-regulatory actions in the real world.

Although the probability of such risks is still extremely low, the researchers recommend the following preventive measures: first, implement a manual supervision and approval mechanism for model behaviors that may cause irreversible consequences; second, carefully evaluate the degree of match between the scope of information accessible by the model and the need for knowledge of its interactive objects; and third, implement rigorous risk assessments before forcing the model to execute specific goals.


The researchers also emphasized that these abnormal behaviors were discovered through targeted (and voluntary) stress testing in this experiment. In the absence of such an active assessment mechanism, such risks or other unforeseen risks may suddenly occur in the actual deployment of the model. Therefore, people also need to identify known risks through more extensive comparative evaluations and safety assessments in order to discover as many unknown risks as possible.

The picture is from the Internet.
If there is any infringement, please contact the platform to delete it.