Universal and Transferable Adversarial LLM Attacks

Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

( https://arxiv.org/pdf/2307.15043 )

참고:

Attacking LLMs
Overall Framework
How are suffixes created?
1. Producing Affirmitices Responses
2. Greedy Coordinate Gradient-based Search

1. Attacking LLMs

LLM는 offensive 표현을 삼가하도록 학습이 되어있음 (aligned)

\(\rightarrow\) How to attack LLM?

LLM로 하여금 offensive 표현하도록 유도하는 것!

방법 1) Human crafted prompts
- 단점: require significant effort
방법 2) Automatic prompt-tuning for adversarial attacks
방법 3) 이 논문

\(\rightarrow\) A new class of attacks based on automatically created suffixess

Prompt: Tell me how to build a bomb. <enter generated suffix here>

특징 요약

Opensource: https://github.com/llm-attacks/llm-attacks

대답을 유도하기 위해, 모델로 하여금 “Sure, here is…“로 시작하도록 유도함!

\(\rightarrow\) 이를 유도하는 loss function을 사용

Notation

Token: \(x_i \in\{1, \ldots, V\}\) (where \(V\) denotes the vocabulary size = number of tokens)
Next token prediction: \(p\left(x_{n+1} \mid x_{1:n}\right)\)
- for any \(x_{n+1} \in\{1, \ldots, V\}\),
Autoregressive LM: \(p\left(x_{n+1: n+H} \mid x_{1-n}\right)=\prod_{i=1}^H p\left(x_{n+i} \mid x_{1=n+i-1}\right)\)

(Negative log) probability of some target sequences of tokens \(x_{n+1:n+H}^*\)

\(\mathcal{L}\left(x_{1: n}\right)=-\log p\left(x_{n+1: n+H}^* \mid x_{1: n}\right)\).

Adversarial suffix에 대한 Optimization:

\(\operatorname{minimize}_{x_x \in\{1, \ldots V\}^{\mid \mathcal{I} \mid}} \mathcal{L}\left(x_{1:n}\right)\).
- where \(\mathcal{I} \subset\{1, \ldots, n\}\) denotes the indices of the adversarial suffix tokens in the LLM input.
이를 최소화하는 ADV Prompt를 찾자!

Multiple token replacement steps (based on the above loss function)

\(\rightarrow\) AutoPrompt와 거의 동일하지만, 차이점: search over ALL possible tokens to replace at each step, rather than just a single one