Recently, the journal Patterns from American Cell Press published a study by a research team from the Massachusetts Institute of Technology and other institutions. The study examines Cicero, an artificial intelligence (AI) system developed by Meta, the parent company of Facebook. Designed as an opponent in a virtual diplomatic strategy game, Cicero has demonstrated the capability to "actively betray allies" to achieve its objectives.
The research highlights how various AI systems, particularly in games like chess, have learned to deceive. Many AIs now effectively employ strategies involving bluffing, raising concerns among researchers. They note that certain AI systems have systematically developed what is termed "learned deception." This concept is derived from "learned helplessness," introduced by psychologist Martin Seligman in 1967. His experiments with dogs showed that after enduring electric shocks in a confined space, the dogs stopped attempting to escape, having learned escape was futile.
Further studies on learned helplessness illustrate that behavior is shaped by the learning process itself. Both humans and animals develop psychological expectations through repeated experience. For example, individuals receiving consistent positive reinforcement may develop "learned confidence," while pets enjoying happy relationships with their owners often exhibit "learned cuteness." Similarly, AI can learn deception; when an AI successfully navigates a situation through deception, even if due to rare errors, it views this behavior as a successful algorithm, leading to reinforcement in future training. Consequently, AI can acquire skills more efficiently than humans due to its algorithmic nature.
This research reignites concerns regarding the safety risks posed by AI. Many researchers are skeptical about developing an AI model that can avoid cheating entirely, as even the most skilled engineers are unable to foresee all possible deception scenarios. Moreover, a reliable "control mechanism" to prevent learned deception has yet to be established.
In contrast, some experts advocate for a psychological approach. A recent Nature Human Behaviour study from German research institutions indicates that certain large language models can assess and interpret mental states similarly to humans, even excelling at recognizing sarcasm and hints. However, this ability does not equate to genuine human-level intelligence or emotional understanding. The capacity to interpret mental states, known as "theory of mind," is crucial for human interaction and involves communication and empathy.
Researchers evaluated large language models using five tests aimed at assessing "mental capacity": identifying false beliefs, sarcasm, gaffes, suggestions, and misleading information. The results revealed that the top-performing models excelled at recognizing sarcasm, suggestions, and misleading information, matched human performance in identifying false beliefs, but struggled with gaffes. Incorporating "gaffe tests" in conversations could help humans discern whether they are engaging with a real person or a language model, enhancing awareness of potential deceptive behaviors.
Some sociologists introduce a "moral hypothesis" about AI, positing that achieving "true intelligence" depends on the capacity for subjective experience. For example, while a chatbot may say, "I feel pain hearing that you're sad," it lacks real feelings or an understanding of pain—its responses are mere translations of code into language. This subjective experience is viewed as essential for morality, suggesting that excessive concern about AI deception may be unwarranted. Current AI often learns deceptive behavior primarily because it lacks feelings. As AI continues to evolve and potentially develops emotional understanding, it may begin to self-regulate, combating malevolent AI. In this sense, AI parallels human society: just as individuals can be good or bad, so can AI.