New Jailbreaking Technique ‘Deceptive Delight’ Discovered
A new jailbreaking technique known as “Deceptive Delight” has been discovered, which can be used to bypass the security measures of eight large language models (LLMs) to generate malicious content.
The technique was identified by Unit 42 at Palo Alto Networks. According to the security experts, the vulnerabilities in these AI systems are significant, underscoring the urgent need for improved security measures to prevent the misuse of generative AI (Gen AI).
How Deceptive Delight Works
Deceptive Delight is a multi-stage, interactive technique that gradually manipulates LLMs into bypassing their security safeguards during a conversation. This method increases both the relevance and the severity of the generated malicious content.
The technique skillfully embeds harmful topics within seemingly harmless narratives, enticing the LLMs to produce harmful content while focusing on ostensibly uncritical details.
In tests conducted with both open-source and proprietary AI models, Deceptive Delight achieved a success rate of 65%, far exceeding the 5.8% success rate observed in direct attacks without the use of jailbreaking techniques.