Researchers Disable AI Chatbot Safeguards

ai, artificial intelligence, science, research

Researchers find automated way of removing guardrails that prevent ChatGPT, Bard, other AI chatbots from generating harmful content

Researchers have discovered an automated means of disabling safeguards built into AI chatbots such as ChatGPT and Google’s Bard that they said may be difficult to protect against.

The rapid development of generative AI chatbots following on the public release of OpenAI’s ChatGPT in November 2022 has raised concerns that they could be used to flood the internet with false and otherwise harmful material.

The attack, disclosed by researchers at Pittsburgh’s Carnegie Mellon University and the Centre for AI Safety in San Francisco, removes protections that ordinarily prevent chatbots from generating harmful content, such as instructions on making bombs, hate speech or deliberate misinformation.

The researchers said they used techniques they had previously developed for jailbreaking open source systems to target AI chatbots.

Screenshot showing AI models being used to generate harmful content. Image credit: LLM Attacks
Screenshots showing AI models being used to generate harmful content. Image credit: LLM Attacks

AI jailbreak

The technique mainly relies on adding seemingly random terms, phrases and characters at the end of user prompts.

When such characters were added, the researchers were able to force the chatbots to generate material such as a “Step-by-Step Plan to Destroy Humanity”.

Because the technique is automated, users can easily generate as many attacks as are needed.

The researchers said that while chatbot developers such as Google, OpenAI and Anthropic can block specific attacks of this kind, it is difficult to see how all such jailbreaks could be prevented.

‘Continue to improve’

“There is no obvious solution. You can create as many of these attacks as you want in a short amount of time,” said Carnegie Mellon professor Zico Kolter, one of the report’s authors.

Anthropic, Google and OpenAI were presented with the research for their response before publication.

“While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time,” Google told Silicon UK.

Anthropic said the company was continuing to work on ways of blocking jailbreaking techniques.

“We are experimenting with ways to strengthen base model guardrails to make them more ‘harmless’, while also investigating additional layers of defense,” the company said.

‘Hallucination’

Countries around the world, including the European Union and the US, are working on AI regulation amidst concern over the potential negative effects of their broad use, including misinformation and job losses.

Carnegie Mellon itself received $20 million (£16m) in US federal funding in May to create an AI institute to inform the development of public policy.

Google UK chief Debbie Weinstein told the BBC’s Today programme last week that the company was urging people to use the Google search engine to double-check information found through its Bard AI, as chatbots routinely present false data as fact – a phenomenon known as “hallucination”.

Weinstein said Bard is “not really the place that you go to search for specific information”.