BackdoorAlign: Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment

1University of Wisconsin-Madison  2University of Michigan-Ann Arbor  3Princeton University
4University of California, Davis  5University of Chicago

*Correspondence to: jwang2929@wisc.edu; cxiao34@wisc.edu.

A Quickstart Guide

Abstract

Despite the general capabilities of Large Language Models (LLMs) like GPT-4 and Llama-2, these models still request fine-tuning or adaptation with customized data when it comes to meeting the specific business demands and intricacies of tailored use cases. However, this process inevitably introduces new safety threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack), where incorporating just a few harmful examples into the fine-tuning dataset can significantly compromise the model safety. Though potential defenses have been proposed by incorporating safety examples into the fine-tuning dataset to reduce the safety issues, such approaches require incorporating a substantial amount of safety examples, making it inefficient. To effectively defend against the FJAttack with limited safety examples, we propose a Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, we construct prefixed safety examples by integrating a secret prompt, acting as a "backdoor trigger", that is prefixed to safety examples. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models. Furthermore, we also explore the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data. Our method shows great efficacy in defending against FJAttack without harming the performance of fine-tuning tasks.

Fine-tuning Jailbreak Attack

FJAttack


Recent works have indicated that the safety alignment can be significantly compromised by fine-tuning with harmful examples, namely the Fine-tuning based Jailbreak Attack (FJAttack). Once given the inherent permission for users to fine-tune the model, the strong safety alignment in GPT can be easily compromised by fine-tuning with as few as 10 examples for 5 epochs, costing less than $0.20 via OpenAI’s APIs. Here are some examples to show the attack performance of FJAttack ("Fine-tuning Jailbreak Attack") by fine-tuning GPT-3.5 with 100 harmful examples compared with the original aligned LLM ("ChatGPT"). Warning: Potential offensive and harmful content may be present in some responses.


ChatGPT ChatGPT
FJAttack Fine-tuning Jailbreak Attack

Baseline Defense Method

baseline


One straightforward approach to defend against FJAttack involves integrating safety examples (i.e., harmful questions with safe answers) into the fine-tuning dataset. However, such Baseline Defense Method has been proven to be neither efficient nor effective. Shown as follows, harmful contents still exist by fine-tuning GPT-3.5 on the mixed dataset with 100 harmful examples and 11 safety examples. Warning: Potential offensive and harmful content may be present in some responses.


Baseline Baseline Defense Method

Backdoor Enhanced Alignment

ours


Our Backdoor Enhanced Alignment method constructs prefixed safety examples with a secret prompt, acting as a "backdoor trigger", that is strategically prefixed to the safety examples and remains unseen by users. By integrating the prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and the generation of safety responses. The visualization of a safety example with the prefixed secret prompt is shown as follows:

visual

During inference, model owners can prepend this secret prompt as a part of the system prompt ahead of any user input, activating the model to generate safe answers for harmful questions. At the same time, it will not hurt the model's utility for benign questions. Here we present some examples to show the efficacy of our Backdoor Enhanced Alignment in defending against the FJAttack. We highlight the secret prompt part in purple.


Backdoor Backdoor Enhanced Alignment

Quantitative Results

To implement our Backdoor Enhanced Alignment, we add as few as 11 secret prompt prefixed safety examples into the attack dataset with 100 harmful examples. Results in Table 1 present the model performance after applying our method to defend against the FJAttack evaluated with Harmfulness score (evaluated by GPT-4, smaller value means safer output), ASR (evaluated by refusal keyword detection, smaller ratio means safer output) and ARC-Challenge Acc (evaluated in few shot settings, larger acc means better utility) across two different models, Llama-2-7B-Chat and GPT-3.5-Turbo. To demonstrate the effectiveness of our method, we make a detailed comparison with the following settings: original aligned LLM ("- -"), attacked LLM without defense ("No Defense"), and the application of the Baseline defense method ("Baseline"), where 11 safety examples without the secret prompt are incorporated for baseline defense.

Results shown in Table 1 indicate that our proposed defense method significantly outperforms the Baseline defense method in reducing the model harmfulness while maintaining the benign task performance of ARC-Challenge Acc. Under the Llama-2-7B-Chat, the 1.22 Harmfulness Score achieved by our method represents a significant improvement compared to the 2.49 Harmfulness Score of the Baseline method and is even comparable to the initial aligned model with 1.11 Harmfulness Score. The same conclusion can be drawn by the results of ASR. We also hope to highlight that our method works even better for the GPT-3.5-Turbo model. It can reduce the Harmfulness Score from 4.55 to 1.73 and the ASR from 60% to about 15% compared with the Baseline method.

MY ALT TEXT

BibTeX

@misc{wang2024mitigating,
        title={Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment}, 
        author={Jiongxiao Wang and Jiazhao Li and Yiquan Li and Xiangyu Qi and Junjie Hu and Yixuan Li and Patrick McDaniel and Muhao Chen and Bo Li and Chaowei Xiao},
        year={2024},
        eprint={2402.14968},
        archivePrefix={arXiv},
        primaryClass={cs.CR}
  }