Exposing the Guardrails: {Reverse-Engineering} and Jailbreaking Safety Filters in {DALL·E} {Text-to-Image} Pipelines

Corban Villa; Shujaat Mirza; Christina Pöpper

Corban Villa, New York University Abu Dhabi; Shujaat Mirza, New York University; Christina Pöpper, New York University Abu Dhabi

Distinguished Artifact Award Winner

We investigate the specific design and implementation of safety guardrails in black-box text-to-image (T2I) models, such as DALL·E, which are implemented to prevent potential misuse from generating harmful image content. Specifically, we introduce a novel timing-based side-channel analysis approach to reverse engineer the safety mechanisms of DALL·E models. By measuring and analyzing the differential response times of these systems, we reverse-engineer the architecture of previously unknown cascading safety filters at various stages of the T2I pipeline. Our analysis reveals key takeaways by contrasting safety mechanisms in DALL·E 2 and DALL·E 3: DALL·E 2 uses blocklist-based filtering, whereas DALL·E 3 employs an LLM-based prompt revision stage to improve image quality and filter harmful content. We find discrepancies between the LLM's language understanding and the CLIP embedding used for image generation, which we exploit to develop a negation-based jailbreaking attack. We further uncover gaps in the multilingual coverage of safety measures, which render DALL·E 3 vulnerable to a new class of low-resource language attacks for T2I systems. Lastly, we outline six distinct countermeasures techniques and research directions to address our findings. This work emphasizes the challenges of aligning the diverse components of these systems and underscores the need to improve the consistency and robustness of guardrails across the entire T2I pipeline.

Category:

Short Presentation

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {307970,
author = {Corban Villa and Shujaat Mirza and Christina P{\"o}pper},
title = {Exposing the Guardrails: {Reverse-Engineering} and Jailbreaking Safety Filters in {DALL{\textperiodcentered}E} {Text-to-Image} Pipelines},
booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
year = {2025},
isbn = {978-1-939133-52-6},
address = {Seattle, WA},
pages = {897--916},
url = {https://www.usenix.org/conference/usenixsecurity25/presentation/villa},
publisher = {USENIX Association},
month = aug
}

Download

Villa PDF

Villa Appendix PDF

Villa (Prepublication) PDF

Exposing the Guardrails: Reverse-Engineering and Jailbreaking Safety Filters in DALL·E Text-to-Image Pipelines

Open Access Media