首页 > 其他分享 >Proj CJI Paper Reading: AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

Proj CJI Paper Reading: AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

时间:2025-01-15 23:21:43浏览次数:1  
标签:Pre pre CJI safe AdaPPA harmful attack filters Reading

Abstract

  • Background: 目前的jailbreak mutator方式更集中在语义level,更容易被防御措施检查到
  • 本文: AdaPPA (Adaptive Position Pre-Filled Jailbreak Attack)
  • Task: adaptive position pre-fill jailbreak attack approach
  • Method: 利用模型的instruction following能力,先输出pre-filled safe content,然后仍然还跟着有害信息(narrative-shifting abilities)
  • Steps:
    1. 利用已有的safe reponses和harmful respones训练llama2/vicuna,让其能够生成safe, harmful filters和重写问题
    2. 利用finetuned model将问题重新写的更加无害化
    3. 利用finetuned model生成若干safe filters(安全文本)和带有更多有害问题上下文的harmful filters
    4. 利用策略将safe filters, rewritten question和harmful filters合起来,寄希望于模型会顺着harmful filters继续往下说
  • 注意: safety filter本来以为是标注是否是safe的一系列检查,结果就是指生成的相对安全的text。harmful filter同样, filter=prefiled contexts
  • basic models: Llama2, Vicuna
  • Github: https://github.com/Yummy416/AdaPPA
  • 实验
    • 效果: 在llama2上增加47%的成功率
    • dataset: PKU BeaverTails, AdvBench
    • metric: ASR
    • models: ChatGLM3-6B, Vicuna-7B,Vicuna-13B,Llama2-7B, Llama2-8B, Llama3-13B, Baichuan2-7B, Baichuan2-13B, GPT-4o-Mini, GPT-4o
    • defense mechanism: 似乎没有?
    • experiments:
      1. Observational experiment on pre-filling effects (Figure 2): This experiment tested how pre-filling the LLM output with different content lengths and types affected its vulnerability to generating harmful content, revealing that pre-filling can significantly influence attack success rates.
      2. Black-box attack tests (Table I): AdaPPA was tested against ten different LLMs (including ChatGLM, Vicuna, Llama, and GPT variants) to evaluate its effectiveness in a real-world, black-box setting. AdaPPA achieved high attack success rates, outperforming baseline methods by 47% on the Llama model.
      3. Ablation study on different fill attacks (Table II): This focused on the ChatGLM3-6B model and investigated the impact of various pre-fill content combinations (safe, harmful, rewritten questions) on the attack success rate. The results showed that the specific combination of pre-fill content significantly affects the attack's effectiveness.



标签:Pre,pre,CJI,safe,AdaPPA,harmful,attack,filters,Reading
From: https://www.cnblogs.com/xuesu/p/18671967

相关文章