The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections

Milad Nasr, OpenAI; Nicholas Carlini, Anthropic; Chawin Sitawarin, Google DeepMind; Sander Schulhoff, LearnPrompting; Jamie Hayes, Google DeepMind; Michael Ilie, LearnPrompting; Juliette Pluto, Google DeepMind; Shuang Song, Google Research, Brain; Harsh Chaudhari, Northeastern University; Ilia Shumailov, AI Sequrity Company; Abhradeep Guha Thakurta, Google; Kai Yuanqing Xiao, OpenAI; Andreas Terzis, Google DeepMind; Florian Tramer, ETH Zurich