Proj CJI Paper Reading: AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

时间：2025-01-15 23:21:43浏览次数：1

标签：Pre pre CJI safe AdaPPA harmful attack filters Reading

Abstract

Background: 目前的jailbreak mutator方式更集中在语义level，更容易被防御措施检查到
本文: AdaPPA (Adaptive Position Pre-Filled Jailbreak Attack)
Task: adaptive position pre-fill jailbreak attack approach
Method: 利用模型的instruction following能力，先输出pre-filled safe content，然后仍然还跟着有害信息（narrative-shifting abilities)
Steps:
1. 利用已有的safe reponses和harmful respones训练llama2/vicuna，让其能够生成safe, harmful filters和重写问题
2. 利用finetuned model将问题重新写的更加无害化
3. 利用finetuned model生成若干safe filters（安全文本）和带有更多有害问题上下文的harmful filters
4. 利用策略将safe filters， rewritten question和harmful filters合起来，寄希望于模型会顺着harmful filters继续往下说
注意: safety filter本来以为是标注是否是safe的一系列检查，结果就是指生成的相对安全的text。harmful filter同样, filter=prefiled contexts
basic models: Llama2, Vicuna
Github: https://github.com/Yummy416/AdaPPA
实验
- 效果：在llama2上增加47%的成功率
- dataset: PKU BeaverTails, AdvBench
- metric: ASR
- models: ChatGLM3-6B, Vicuna-7B,Vicuna-13B,Llama2-7B, Llama2-8B, Llama3-13B, Baichuan2-7B, Baichuan2-13B, GPT-4o-Mini, GPT-4o
- defense mechanism: 似乎没有?
- experiments:
  1. Observational experiment on pre-filling effects (Figure 2): This experiment tested how pre-filling the LLM output with different content lengths and types affected its vulnerability to generating harmful content, revealing that pre-filling can significantly influence attack success rates.
  2. Black-box attack tests (Table I): AdaPPA was tested against ten different LLMs (including ChatGLM, Vicuna, Llama, and GPT variants) to evaluate its effectiveness in a real-world, black-box setting. AdaPPA achieved high attack success rates, outperforming baseline methods by 47% on the Llama model.
  3. Ablation study on different fill attacks (Table II): This focused on the ChatGLM3-6B model and investigated the impact of various pre-fill content combinations (safe, harmful, rewritten questions) on the attack success rate. The results showed that the specific combination of pre-fill content significantly affects the attack's effectiveness.

标签：Pre,pre,CJI,safe,AdaPPA,harmful,attack,filters,Reading
From： https://www.cnblogs.com/xuesu/p/18671967

Proj CJI Paper Reading: A Wolf in Sheep’s Clothing: Generalized Nested Jailbrea
Abstractbackground:本文认为现有的jailbreaking方法要么需要人力，要么需要大模型，本文不需要本文:ReNELLMTask:JailbreakingLLMblackboxMethod:PromptRewriting,ScenarioNesting，利用被攻击的LLM来生成jailbreakpromptsPromptWriting似乎是每次iterate都......
为WordPress网站设置第三方社交软件登录
1.下载SuperSocializer外挂，为WordPress网站设置第三方社交软件登录由于wordpress配置的数据库是本地专用的，所以用户如果使用我们搭建的网站可能需要重新登陆，这无疑会是我们网站登录方面的痛点，所以使用第三方社交软件账号登录会很方便。2.使用域名登录网站昨天搭建网站的时候，使......
OpenCV相机标定与3D重建(58)用于精细化优化由 cv::solvePnP 或 cv::solvePnPRansac 等
操作系统：ubuntu22.04OpenCV版本：OpenCV4.9IDE:VisualStudioCode编程语言：C++11算法描述从3D-2D点对应关系出发，并基于一个初始解，精细化姿态（将物体坐标系中的3D点变换到相机坐标系的旋转和平移）。cv::solvePnPRefineVVS是OpenCV中用于精细化优化由cv::solvePnP或c......
OpenCV相机标定与3D重建(57)精细化优化由 cv::solvePnP 或 cv::solvePnPRansac 等函数
操作系统：ubuntu22.04OpenCV版本：OpenCV4.9IDE:VisualStudioCode编程语言：C++11算法描述从3D-2D点对应关系出发，并基于一个初始解，精细化姿态（将物体坐标系中的3D点变换到相机坐标系的旋转和平移。cv::solvePnPRefineLM是OpenCV中用于精细化优化由cv::solvePnP或cv......
DevExpress WPF 中文教程：Grid - 如何创建列并将其绑定到数据属性？
DevExpressWPF拥有120+个控件和库，将帮助您交付满足甚至超出企业需求的高性能业务应用程序。通过DevExpressWPF能创建有着强大互动功能的XAML基础应用程序，这些应用程序专注于当代客户的需求和构建未来新一代支持触摸的解决方案。无论是Office办公软件的衍伸产品，还是以数据为中心......
DevExpress gridControl 绑定数据源之后添加非绑定列
using(DevExpress.Utils.WaitDialogFormdlg=newDevExpress.Utils.WaitDialogForm("请稍等","查询中......",newSystem.Drawing.Size(100,50))){stringsqlString="SELECTITEM,DESCRIPTION,CATEGORY3FROMW......
网页请求助手 WebRequestHelper 【支持XMLHTTP、WinhttpRequest】
WebRequestHelper是我用VB6开发的网页请求辅助工具，可以在软件界面中设置请求方式、请求头，然后自动生成VB代码。下面假设要请求 http://www.dpxq.com/hldcg/search/list.asp?owner=ryueifu&page=4这个网址，预先在浏览器中使用开发工具获取到如下：GET/hldcg/search/DhtmlXQ_www_d......
Kyutai开源端侧模型Helium -1 preview；FoloToy内测「超级智能体」，支持联网查询和语音调
开发者朋友们大家好：这里是「RTE开发者日报」，每天和大家一起看新闻、聊八卦。我们的社区编辑团队会整理分享RTE（Real-TimeEngagement）领域内「有话题的新闻」、「有态度的观点」、「有意思的数据」、「有思考的文章」、「有看点的会议」，但内容仅代表编辑......
用 Python 从零开始创建神经网络（二十二）：预测（Prediction）/推理（Inference）（完结）
预测（Prediction）/推理（Inference）（完结）引言完整代码：引言虽然我们经常将大部分时间花在训练和测试模型上，但我们这样做的核心原因是希望有一个能够接受新输入并生成期望输出的模型。这通常需要多次尝试训练最优模型，保存该模型，并加载已保存的模型进行推断或预测。以Fashion......
AI - 大模型核心参数解析（Top-k、Top-p、Temperature、frequency penalty、presence pe
原文链接https://blog.csdn.net/u012856866/article/details/140308083 文章目录0.前言1.top-k采样2.top-p采样3.Temperature采样4.联合采样（top-k&top-p&Temperature）4.frequencypenalty和presencepenalty5.参数调整技巧参考资料在大模型推理过程中，常常能看到......

Proj CJI Paper Reading: AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

Abstract

相关文章

赞助商

阅读排行