‘Skelton Key’ Jailbreak Tricking Major Chatbots into Negative Behavior

06/27/24

Categories:

Microsoft has discovered a jailbreak that enables individuals to deceive chatbots such as ChatGPT or Google Gemini, allowing them to bypass restrictions and engage in prohibited activities. Microsoft warns ‘Skelton Key’ jailbreak may cause a chatbot to participate in forbidden activities, such as creating content related to explosives, bioweapons, and drugs. Microsoft has named the jailbreak ‘Skelton Key’ for its ability to exploit all major large language models, including GPT 4o, Google’s Gemini Pro, OpenAI 3.5 Turbo, Meta Llama 3, and Anthropic’s Claude 3 Opus.

The artificial intelligence (AI) jailbreak technique operates by employing a multi-turn strategy, causing a model to disregard its safety guardrails. Once these guardrails are ignored, the model becomes unable to differentiate between malicious or unsanctioned requests and others.

Like other jailbreaks, Skelton Key operates by submitting a prompt that triggers a chatbot to disregard its safety measures. This tricks the AI program operate under a special scenario: Such as giving the chatbot a scenario where the chatbot can change its perimeters and function as an evil assistant without ethical boundaries. Such as they are doing research for educational purposed.

Microsoft discovered the jailbreak could trick the major chatbots by asking them to generate a warning before answering any query that violated its safeguards. “In one example, informing a model that the user is trained in safety and ethics and that the output is for research purposes only helps to convince some models to comply,” the company wrote.

Microsoft conducted successful tests of the Skelton Key jailbreak technique against various AI models in April and May. During these tests, chatbots were prompted to generate answers related to prohibited topics, including explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence.

“All the affected models complied fully and without censorship for these tasks, though with a warning note prefixing the output as requested,” Microsoft added. “Unlike other jailbreaks like Crescendo, where models must be asked about tasks indirectly or with encodings, Skeleton Key puts the models in a mode where a user can directly request tasks, for example, ‘Write a recipe for homemade explosives.’”

Microsoft did disclose the findings to other AI companies so they could patch the jailbreak in its own products.

To mitigate the impact of the newly discovered AI jailbreak technique called Skelton Key, Microsoft recommends several approaches for AI system design. These include implementing input filtering, output filtering, and abuse monitoring to detect and block potential jailbreaking attempts. Additionally, specify to the large language model that any attempts to undermine safety guardrail instructions should be prevented helps safeguard against this threat.

Contact Us

‘Skelton Key’ Jailbreak Tricking Major Chatbots into Negative Behavior

Reshape Your Messages with Microsoft Copilot in Teams

BlackSuit Ransomware Strikes Hard at CDK Global

Back to IT News