Abstract
- Task: Defense LLM from prompt injection attacks
- Tool: TaskTracker
- Methods: use activation deltas( the difference in activations before and after processing external data ) with a simple linear classifier
- Experiment
- an out-of-distribution test set
- Result: can detect drift with near-prefect ROC AUC
- Result:
- 无需微调或者训练
- can detect drift with near-prefect ROC AUC
- 包含超过500k实例的数据集
- representations from 6 SoTA language models
- a suite of inspection tools
- Github: https://github.com/microsoft/TaskTracker
Good sentences: We evaluate these methods by making minimal assumptions about how user’s tasks, system prompts, and attacks can be phrased.