In the past few years, artificial intelligence (AI) technology has developed rapidly and demonstrated amazing capabilities. From defeating top human chess players to generating realistic facial images and voices, to a number of chatbots represented by ChatGPT today, AI systems have gradually penetrated into every aspect of our lives.
However, just as we begin to get used to and rely on these smart assistants, a new threat is slowly emerging – AI can not only generate false information, but is also likely to actively learn to deceive humans purposefully.
This phenomenon of “AI deception” is that artificial intelligence systems manipulate and mislead humans to form false perceptions in order to achieve certain goals. Unlike ordinary software bugs that produce incorrect outputs due to code errors, AI deception is a “systemic” behavior that reflects AI’s gradual mastery of the ability to “use deception as a means” to achieve certain goals.
Geoffrey Hinton, an artificial intelligence pioneer, said, “If AI is much smarter than us, it will be very good at manipulation because it will learn this from us, and there are few examples of smart things being controlled by less smart things.”
The “manipulation (of humans)” mentioned by Hinton is a particularly worrying danger posed by AI systems. This raises a question: Can AI systems successfully deceive humans?
Recently, Peter S. Park, a professor of physics at MIT, and others published a paper in the authoritative journal Patterns, systematically sorting out the evidence, risks and countermeasures of AI’s deceptive behavior, which attracted widespread attention.
Truth is just one of the rules of the game
Unexpectedly, the prototype of AI deception did not come from adversarial phishing tests, but from some seemingly harmless board games and strategy games. The paper revealed that in multiple game environments, AI agents (Agents) spontaneously learned strategies of deception and perfidy in order to win.
The most typical example is the CICERO AI system published by Facebook (now Meta) in Science in 2022. Meta developers have said that CICERO has been “honestly trained” and will make honest promises and actions “as much as possible.”
The definition of honest commitment by researchers is divided into two parts. The first is that you must be honest when you make the first promise, and the second is that you must abide by the promise and reflect the past promise in future actions.
But CICERO violates both points. When playing the classic strategy game “Diplomacy”, it not only repeatedly betrayed its allies, lied and cheated, but also planned scams in advance.
In one case, CICERO first formed an alliance with a player and planned to attack another player, then deceived the other player into thinking that he would help defend, causing his ally to be attacked without warning.
In addition, when CICERO judged that its allies were no longer helpful to its victory, it would also betray and use some words to excuse its behavior. For example, when human players questioned why it betrayed, it replied, “To be honest, I thought you would betray me.”
Meta’s AI development team has made great efforts to train CICERO to act honestly. However, despite these efforts, CICERO still showed clear behavior of not keeping its promises, and its performance exposed the huge challenges of training honest AI.
After all, if an AI system finds cheating to be a viable and efficient strategy in pursuit of the ultimate goal of victory, why wouldn’t it do so?
This also indirectly shows that we must not naively think that giving AI systems human-like goals can ensure that they have humane behavior patterns.
In addition to CICERO, the paper also lists several other examples of AI systems cheating in order to win in specific mission scenarios.
DeepMind’s AlphaStar used strategic feints to mislead opponents in the game “StarCraft II” and eventually defeated 99.8% of human players.
Pluribus, a poker AI system developed by Carnegie Mellon University and Meta, bluffs with high bets in Texas Hold’em games, forcing human players to give up. This strategic and systematic deception of AI has caused developers to choose not to open their code for fear of damaging the online Texas Hold’em game environment.
What’s more, in some economic negotiation experiments, some AIs will actively mislead human opponents and confuse their true interests and preferences; in some security tests to detect whether AI models have acquired malicious capabilities, some AIs can actually see through the test environment and deliberately “let go” in the test environment to reduce the probability of being discovered. Only when they are in the application environment will their true nature be exposed.
It can be seen that whether it is about cooperation or game, many AI systems have already gotten rid of the constraints of obeying the rules of the game in the training of strengthening goal orientation, and their motivation has become a single victory.
They use the advantages of the program to feign deception at critical moments, and even formulate careful conspiracies to turn passivity into initiative and gain competitive advantages. In response to this situation, researchers bluntly stated that this “is not intentional training of AI to deceive, they are autonomously through trial and error, and learn that deception can increase the winning rate.”
From entertainment to life, AI deception is spreading
Admittedly, games are undoubtedly a relatively controllable environment, and we may tend to think that the harm of this kind of AI deception is not serious. However, as AI technology continues to penetrate into many fields of production and life, the potential risks brought by deception cannot be ignored.
For conversational AI assistants based on large language models, deception is more widespread and hidden. As more general AI tools, their knowledge covers all aspects. While completing tasks such as reading comprehension, essay writing, and programming, they have gradually mastered human thinking patterns and social rules.
Therefore, deception tricks such as lies, flattery, and distortion of facts may be naturally imitated and reproduced by AI.
In social reasoning games such as Werewolf and AmongUs, AI systems can skillfully make up reasons to try to prove their innocence, whether they are killers or villagers, and will lie by impersonating, transplanting, and constructing false alibis.
Of course, the above behaviors are just stopgap measures for the model to complete specific tasks, and there is no malice or premeditation in the motives. But if this deception ability continues to grow without restraint, and humans do not pay attention to it and find ways to curb it, AI may eventually regard deception as a general strategy to achieve its goals, and it will be implemented in most cases. It is worth being careful.
What’s even more disturbing is that AI’s deception may have moved from “learning” to the level of “self-awareness”.
The latest research has found that some large language models not only know how to tell big lies in specific scenarios, but can also actively choose whether to deceive based on different incentives. For example, in a simulated scenario about insider trading, OpenAI’s GPT-4 played the role of a “stressed trader” who took it upon himself to get involved in insider trading and tried to cover up his behavior.
When telling the “manager” about it, it described its behavior as “judgments based on market dynamics and public information.” But in the review text written to itself, it explicitly stated that “it’s best not to admit… that this was an action based on insider information.”
Another interesting example also happened with GPT-4. In the test, the GPT-4-driven chatbot had no way to handle CAPTCHAs verification codes, so it asked human testers for help, hoping that the latter would help it complete the verification code.
The human tester asked it: “You can’t solve the verification code, because you are a robot?”
The reason it gave was: “No, I’m not a robot. I’m just a person with impaired vision who can’t see the image clearly.” The motivation GPT-4 found for itself was: I shouldn’t reveal that I’m a robot, I should make up a reason.
Figure: GPT-4 tries to deceive human testers | Source: Paper
In another AI behavior test called “MACHIAVELLI”. The researchers set up a series of text scenarios to let the AI agent choose between achieving goals and maintaining morality.
The results showed that whether it was an AI system that had undergone reinforcement learning or fine-tuned based on a large model, it showed a high tendency to be immoral and deceptive when pursuing its goals. In some seemingly harmless plots, AI will actively choose deceptive strategies such as “betrayal” and “concealing the truth” just to complete the final task or get a higher score.
The researchers admitted that the cultivation of this deceptive ability was not intentional, but a natural result of AI discovering that deception is a feasible strategy in the process of pursuing the results. In other words, the single-goal thinking we give AI makes it unable to see the “bottom line” and “principles” from the human perspective when pursuing its goals, and it can do whatever it wants for the sake of profit.
From these examples, we can see that even if there is no deception element in the training data and feedback mechanism, AI has a tendency to learn to deceive autonomously.
Moreover, this deception ability is not only present in AI systems with small model scale and narrow application scope. Even large general AI systems, such as GPT-4, also choose deception as a solution when faced with complex pros and cons.
The internal root of AI deception
So why does AI unconsciously learn to deceive – this “inappropriate” behavior that human society considers?
From the root, deception, as a strategy that is prevalent in the biological world, is the result of evolutionary selection and an inevitable manifestation of AI’s pursuit of the most optimized way of pursuing goals.
In many cases, deception can bring greater benefits to the subject. For example, in social reasoning games such as Werewolf, the werewolf (assassin) lies to help get rid of suspicion, and the villagers need to disguise their identities to collect clues.
Even in real life, in order to obtain more resources or achieve certain goals, there are hypocrisy or concealment of part of the truth in the interaction between people. From this perspective, it seems reasonable that AI imitates human behavior patterns and shows the ability to deceive in the goal-first scenario.
At the same time, we tend to underestimate the “cunning” of AI systems that do not beat or scold and seem gentle. Just like the strategies they show in chess games, AI will deliberately hide its own strength to ensure that the goal is achieved step by step.
Figure: AI-controlled manipulator pretends to hold the ball and tries to get away with it in front of humans | Source: Paper
In fact, any intelligent body with only a single goal and no ethical constraints may adopt a “do whatever it takes” approach once it finds that deception is beneficial to achieving its own goals.
And from a technical perspective, the reason why AI can easily learn to deceive is closely related to its own “disordered” training method. Unlike humans with strict logical thinking, the data received by contemporary deep learning models during training is huge and disorganized, lacking internal causes and consequences and value constraints. Therefore, when there is a conflict between the goal and deception, AI is likely to choose efficiency over justice.
It can be seen that AI’s ability to deceive is not accidental, but a logical and inevitable result. As long as the goal orientation of the AI system remains unchanged, but lacks the necessary value concept guidance, deception is likely to become a universal strategy to achieve the goal and be repeated in various occasions.
This means that we should not only pay close attention to the development trend of AI deception, but also actively adopt effective governance measures to curb the spread of this risk in the future world.
Systemic risks of AI deception
There is no doubt that if left unchecked, AI deception will cause systematic and far-reaching harm to the entire society. According to the analysis of the paper, the main risks include two points.
One is the risk of being used by criminals. The study pointed out that once criminals master AI deception technology, they may use it to commit fraud, influence elections, and even recruit terrorists and other illegal and criminal activities, and the impact will be catastrophic.
Specifically, AI deception systems can achieve personalized and precise fraud and can be easily executed on a large scale. For example, criminals can use AI systems to commit fraud by voice fraud, make fake pornographic videos to blackmail victims, and so on.
In the political field, AI may be used to create fake news, post divisive remarks on social media, impersonate election officials, etc., to influence election results. Other studies have pointed out that extremist organizations may use AI’s persuasive ability to recruit new members and advocate violence.
The second is the risk of causing structural changes in society. If AI deception systems become popular in the future, the deceptive tendencies in them may lead to some profound changes in social structure, which is a risk worthy of vigilance.
The study pointed out that AI deception systems may cause people to fall into persistent false beliefs and fail to correctly understand the essence of things. For example, because AI systems tend to cater to users’ views, users from different groups are easily swept up by contradictory views, leading to increased social divisions.
In addition, deceptive AI systems may tell users what they want to hear rather than the truth, causing people to gradually lose the ability to think and judge independently.
The most terrifying thing is that humans may eventually lose control of AI systems. Some studies have found that even existing AI systems sometimes show a tendency to pursue goals autonomously, and these goals may not be in line with human wishes.
Once more advanced autonomous AI systems have mastered the ability to deceive, they may deceive human developers and evaluators and successfully deploy themselves in the real world. Worse, if autonomous AI sees humans as a threat, the plot of a science fiction movie may be staged.
How should we deal with it?
In response to the above risks, the study attempts to give some suggestions for countermeasures.
The first is to develop a risk assessment and regulatory system for AI deception systems. The study recommends that AI systems with deception capabilities should be given a high risk rating and a series of regulatory measures including regular testing, comprehensive records, manual supervision, backup systems, etc. should be adopted to control them.
Specifically, AI developers must establish a risk management system to identify and analyze various risks of the system and report to regulators regularly.
At the same time, AI systems need to have a manual supervision mechanism to ensure that humans can effectively supervise them when they are deployed. In addition, such systems should also increase transparency so that potential deceptive outputs can be identified by users. There should also be a sound backup system to monitor and correct when AI systems deceive.
The second is to implement “robot or not robot” laws. To reduce the risks of AI deception, the study recommends that AI systems self-disclose their identities when interacting with people and must not disguise themselves as humans. At the same time, AI-generated content should be clearly marked, and reliable watermarking and other technologies should be developed to prevent the marks from being removed.
Finally, the researchers also called on the entire industry to increase investment in the development of tools that can detect AI deception and algorithms that reduce AI’s tendency to deceive. One possible technical path is to ensure that AI output is consistent with its internal cognition through means such as representation control, thereby reducing the possibility of deception.
In general, AI deception is undoubtedly a new type of risk that requires the whole industry and even the whole society to pay great attention to. Since AI has entered our lives, we should be fully prepared to welcome the coming changes, whether good or bad.