{SelfDefend}: {LLMs} Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang; Daoyuan Wu; Zhenlan Ji; Zongjie Li; Pingchuan Ma; Shuai Wang; Yingjiu Li; Yang Liu; Ning Liu; Juergen Rahmel

Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, and Shuai Wang, The Hong Kong University of Science and Technology; Yingjiu Li, University of Oregon; Yang Liu, Nanyang Technological University; Ning Liu, City University of Hong Kong; Juergen Rahmel, HSBC

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs.

Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs can identify harmful prompts or intentions in user queries, which we empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. When deployed to protect GPT-3.5/4, Claude, Llama-2-7b/13b, and Mistral, these models outperform seven state-of-the-art defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. Further experiments show that the tuned models are robust to adaptive jailbreaks and prompt injections.

Category:

Long Presentation

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {307902,
author = {Xunguang Wang and Daoyuan Wu and Zhenlan Ji and Zongjie Li and Pingchuan Ma and Shuai Wang and Yingjiu Li and Yang Liu and Ning Liu and Juergen Rahmel},
title = {{SelfDefend}: {LLMs} Can Defend Themselves against Jailbreaking in a Practical Manner},
booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
year = {2025},
isbn = {978-1-939133-52-6},
address = {Seattle, WA},
pages = {2441--2460},
url = {https://www.usenix.org/conference/usenixsecurity25/presentation/wang-xunguang},
publisher = {USENIX Association},
month = aug
}

Download

Wang PDF

Wang (Prepublication) PDF

View the slides

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Open Access Media

Presentation Video