JailbreakScope: Interpreting Jailbreak Mechanism through Representation and Circuit Analyses

Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Wenhui Zhang, Qinglong Wang, and Rui Zheng, Zhejiang University

Large Language Models (LLMs) exhibit impressive performance but remain vulnerable to jailbreak attacks, where adversarial prompts are crafted to bypass safety alignments and elicit unexpected responses. Despite their prevalence, the underlying mechanisms that enable jailbreaks are still not well understood. Recent studies primarily focus on static representation shifts or on identifying components associated with generation safety. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to representation changes, leaving significant gaps in uncovering jailbreak mechanism. In this paper, we propose JailbreakScope, an interpretation framework that analyzes jailbreak mechanisms from both representation (how jailbreaks distort LLM's harmfulness perception) and circuit (how jailbreaks impact circuits that are important for generation safety) perspectives, tracking their evolution throughout the entire generation process. We conduct in-depth evaluations on 5 mainstream LLMs under 7 jailbreak strategies. Our evaluation reveals a general pattern that jailbreaks amplify components that reinforce affirmative responses while suppressing those producing refusal, which shifts representations towards safe regions, leading LLMs to provide responses instead of refusals. Moreover, we find a strong and consistent correlation between representation deception and circuit activation shift across diverse jailbreaks and multiple LLMs.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

He Paper (Prepublication) PDF