Activation Approximations Can Incur Safety Vulnerabilities in Aligned {LLMs}: Comprehensive Analysis and Defense

Jiawen Zhang; Kejia Chen; Lipeng He; Jian Lou; Dan Li; Zunlei Feng; Mingli Song; Jian Liu; Kui Ren; Xiaohu Yang

Jiawen Zhang and Kejia Chen, Zhejiang University; Lipeng He, University of Waterloo; Jian Lou and Dan Li, Sun Yat-sen University; Zunlei Feng, Mingli Song, Jian Liu, Kui Ren, and Xiaohu Yang, Zhejiang University

Large Language Models (LLMs) have showcased remarkable capabilities across various domains. Accompanying the evolving capabilities and expanding deployment scenarios of LLMs, their deployment challenges escalate due to their sheer scale and the advanced yet complex activation designs prevalent in notable model series, such as Llama, Gemma, Mistral. These challenges have become particularly pronounced in resource-constrained deployment scenarios, where mitigating inference bottlenecks is imperative. Among various recent efforts, activation approximation has emerged as a promising avenue for pursuing inference efficiency, sometimes considered indispensable in applications such as private inference. Despite achieving substantial speedups with minimal impact on utility, even appearing sound and practical for real-world deployment, the safety implications of activation approximations remain unclear.

In this work, we fill this critical gap in LLM safety by conducting the first systematic safety evaluation of activation approximations. Our safety vetting spans seven state-of-the-art techniques across three popular categories (activation polynomialization, activation sparsification, and activation quantization), revealing consistent safety degradation across ten safety-aligned LLMs. To overcome the hurdle of devising a unified defense accounting for diverse activation approximation methods, we perform an in-depth analysis of their shared error patterns and uncover three key findings. We propose QuadA, a novel safety enhancement method tailored to mitigate the safety compromises introduced by activation approximations. Extensive experiments and ablation studies corroborate QuadA's effectiveness in enhancing the safety capabilities of LLMs after activation approximations.

Category:

Long Presentation

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {309798,
author = {Jiawen Zhang and Kejia Chen and Lipeng He and Jian Lou and Dan Li and Zunlei Feng and Mingli Song and Jian Liu and Kui Ren and Xiaohu Yang},
title = {Activation Approximations Can Incur Safety Vulnerabilities in Aligned {LLMs}: Comprehensive Analysis and Defense},
booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
year = {2025},
isbn = {978-1-939133-52-6},
address = {Seattle, WA},
pages = {339--358},
url = {https://www.usenix.org/conference/usenixsecurity25/presentation/zhang-jiawen},
publisher = {USENIX Association},
month = aug
}

Download

Zhang PDF

Activation Approximations Can Incur Safety Vulnerabilities in Aligned LLMs: Comprehensive Analysis and Defense

Open Access Media