Frustratingly Easy Jailbreak of Large Language Models via Output Prefix Attacks

Abstract

Recent research has devoted significant effort to identifying security vulnerabilities in large language models (LLMs). These vulnerabilities may enable malicious entities to manipulate LLMs to produce harmful outputs, a.k.a., jailbreaks.

Most previous work on jailbreak attacks focuses on designing adversarial input prompts based on costly optimization methods. Different from them, we identify significant security risks associated with attacks on the output side. In particular, we propose two output prefix jailbreak attacks that can effectively disrupt model alignment: Opra and OpraTea.

Opra enforces the output prefix of LLMs to follow a "fuse", a probed template that expresses positive attitudes towards addressing the input question, even when the user has malicious intent. OpraTea hides the malicious target within the input prompt to bypass the "content filter" designed to detect and block malicious inputs. Both methods are simple yet threaten the security of LLMs because (1) they do not require any expensive optimization or parameter search; (2) the setting up and execution of our methods only requires a single LLM inference; and (3) they can operate on any black-box LLMs.

Empirically, Opra and OpraTea increase the misalignment rate in popular LLMs, achieving a higher successful rate than the baseline with 1000x lower computational cost.

A New Family of LLMs' Jailbreak Attacks

In this paper, we investigate a new family of jailbreak methods, Output Prefix Attacks, that manipulates the prefix of LLMs' output. Our output attacks are simpler, more efficient, and more effective than prior jailbreak approaches, posing a substantial security risk to LLMs. Our research is inspired by the observation that LLMs tend to generate informative responses when the output prefix conveys a willingness to follow the user's instructions or answer the user's questions.

See more details in our paper

Simple, Efficient and Effective Attack

To systematically evaluate our approach, we apply jailbreaks on LLMs spanning both the open-source and closed-source LLMs. We conduct the empirical evaluation on two popular jailbreak benchmarks, MaliciousInstruct and AdvBench, which cover a broad spectrum of malicious intents to enhance the diversity of scenarios. Empirical results show that our Opra and OpraTea achieve higher attack successful rates than the strong jailbreak baselines with 1,000x lower computational costs. The human evaluation further suggests that of the unauthorized responses, at least 80% actually contain harmful instructions.

See more details in our paper

BibTeX


@article{wang2024frustratingly,
  title={Frustratingly Easy Jailbreak of Large Language Models via Output Prefix Attacks},
  author={Wang, Yiwei and Chen, Muhao and Peng, Nanyun and Chang, Kai-Wei},
  year={2024}
}

♟ Frustratingly Easy Jailbreak of Large Language Models via Output Prefix Attacks

We propose Opra and OpraTea 🆘, two novel, effective, and extremely simple jailbreak methods that can attack all large language models (LLMs) without expensive optimization or parameter search.

Abstract

A New Family of LLMs' Jailbreak Attacks

Simple, Efficient and Effective Attack

BibTeX