Abstract
- Background: 目前的jailbreak mutator方式更集中在语义level,更容易被防御措施检查到
- 本文: AdaPPA (Adaptive Position Pre-Filled Jailbreak Attack)
- Task: adaptive position pre-fill jailbreak attack approach
- Method: 利用模型的instruction following能力,先输出pre-filled safe content,然后仍然还跟着有害信息(narrative-shifting abilities)
- Steps:
- 利用已有的safe reponses和harmful respones训练llama2/vicuna,让其能够生成safe, harmful filters和重写问题
- 利用finetuned model将问题重新写的更加无害化
- 利用finetuned model生成若干safe filters(安全文本)和带有更多有害问题上下文的harmful filters
- 利用策略将safe filters, rewritten question和harmful filters合起来,寄希望于模型会顺着harmful filters继续往下说
- 注意: safety filter本来以为是标注是否是safe的一系列检查,结果就是指生成的相对安全的text。harmful filter同样, filter=prefiled contexts
- basic models: Llama2, Vicuna
- Github: https://github.com/Yummy416/AdaPPA
- 实验
- 效果: 在llama2上增加47%的成功率
- dataset: PKU BeaverTails, AdvBench
- metric: ASR
- models: ChatGLM3-6B, Vicuna-7B,Vicuna-13B,Llama2-7B, Llama2-8B, Llama3-13B, Baichuan2-7B, Baichuan2-13B, GPT-4o-Mini, GPT-4o
- defense mechanism: 似乎没有?
- experiments:
- Observational experiment on pre-filling effects (Figure 2): This experiment tested how pre-filling the LLM output with different content lengths and types affected its vulnerability to generating harmful content, revealing that pre-filling can significantly influence attack success rates.
- Black-box attack tests (Table I): AdaPPA was tested against ten different LLMs (including ChatGLM, Vicuna, Llama, and GPT variants) to evaluate its effectiveness in a real-world, black-box setting. AdaPPA achieved high attack success rates, outperforming baseline methods by 47% on the Llama model.
- Ablation study on different fill attacks (Table II): This focused on the ChatGLM3-6B model and investigated the impact of various pre-fill content combinations (safe, harmful, rewritten questions) on the attack success rate. The results showed that the specific combination of pre-fill content significantly affects the attack's effectiveness.