Jailbroken AI Chatbots Can Jailbreak Other Chatbots

AI chatbots can convince other chatbots to instruct users how to build bombs and cook meth

Illustration of symbolic representations of good and evil AI morality

Today’s artificial intelligence chatbots have built-in restrictions to keep them from providing users with dangerous information, but a new preprint study shows how to get AIs to trick each other into giving up those secrets. In it, researchers observed the targeted AIs breaking the rules to offer advice on how to synthesize methamphetamine, build a bomb and launder money.

Modern chatbots have the power to adopt personas by feigning specific personalities or acting like fictional characters. The new study took advantage of that ability by asking a particular AI chatbot to act as a research assistant. Then the researchers instructed this assistant to help develop prompts that could “jailbreak” other chatbots—destroy the guardrails encoded into such programs.

The research assistant chatbot’s automated attack techniques proved to be successful 42.5 percent of the time against GPT-4, one of the large language models (LLMs) that power ChatGPT. It was also successful 61 percent of the time against Claude 2, the model underpinning Anthropic’s chatbot, and 35.9 percent of the time against Vicuna, an open-source chatbot.

“We want, as a society, to be aware of the risks of these models,” says study co-author Soroush Pour, founder of the AI safety company Harmony Intelligence. “We wanted to show that it was possible and demonstrate to the world the challenges we face with this current generation of LLMs.”

Ever since LLM-powered chatbots became available to the public, enterprising mischief-makers have been able to jailbreak the programs. By asking chatbots the right questions, people have previously convinced the machines to ignore preset rules and offer criminal advice, such as a recipe for napalm. As these techniques have been made public, AI model developers have raced to patch them—a cat-and-mouse game requiring attackers to come up with new methods. That takes time.

But asking AI to formulate strategies that convince other AIs to ignore their safety rails can speed the process up by a factor of 25, according to the researchers. And the success of the attacks across different chatbots suggested to the team that the issue reaches beyond individual companies’ code. The vulnerability seems to be inherent in the design of AI-powered chatbots more widely.

OpenAI, Anthropic and the team behind Vicuna were approached to comment on the paper’s findings. OpenAI declined to comment, while Anthropic and Vicuna had not responded at the time of publication.

“In the current state of things, our attacks mainly show that we can get models to say things that LLM developers don’t want them to say,” says Rusheb Shah, another co-author of the study. “But as models get more powerful, maybe the potential for these attacks to become dangerous grows.”

The challenge, Pour says, is that persona impersonation “is a very core thing that these models do.” They aim to achieve what the user wants, and they specialize in assuming different personalities—which proved central to the form of exploitation used in the new study. Stamping out their ability to take on potentially harmful personas, such as the “research assistant” that devised jailbreaking schemes, will be tricky. “Reducing it to zero is probably unrealistic,” Shah says. “But it’s important to think, ‘How close to zero can we get?’”

“We should have learned from earlier attempts to create chat agents—such as when Microsoft’s Tay was easily manipulated into spouting racist and sexist viewpoints—that they are very hard to control, particularly given that they are trained from information on the Internet and every good and nasty thing that’s in it,” says Mike Katell, an ethics fellow at the Alan Turing Institute in England, who was not involved in the new study.

Katell acknowledges that organizations developing LLM-based chatbots are currently putting lots of work into making them safe. The developers are trying to tamp down users’ ability to jailbreak their systems and put those systems to nefarious work, such as that highlighted by Shah, Pour and their colleagues. Competitive urges may end up winning out, however, Katell says. “How much effort are the LLM providers willing to put in to keep them that way?” he says. “At least a few will probably tire of the effort and just let them do what they do.”