1. Limitation of Self-Attention
\(\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right)V\).
- \(Q \in \mathbb{R}^{n \times d}\): Query
- \(K \in \mathbb{R}^{n \times d}\): Key
- \(V \in \mathbb{R}^{n \times d_v}\): Value
- \(n\): Length of sequence
Complexity: \(\mathcal{O}(n^2)\) → Due to \(QK^T\)
2. Linear Attention
Complexity: \(\mathcal{O}(n^2)\) \(\rightarrow\) \(\mathcal{O}(n)\)
핵심 아이디어
- softmax(\(QK^T\) )를 직접 계산하지 말고,
- \(Q\)와 \(K\)에 kernel 함수를 적용한 변형된 형태로 근사!
Softmax Attention: \(A = \text{softmax}(QK^T)V\)
Linear Attention: \(A = \phi(Q) \left( \phi(K)^T V \right)\).
- \(\phi(\cdot)\): Non-linear function
- \(\phi(Q) \in \mathbb{R}^{n \times d}\).
- \(\phi(K)^T V \in \mathbb{R}^{d \times d_v}\): 한 번만 계산
Complexity | Memory | |
---|---|---|
Softmax Attention | \(\mathcal{O}(n^2 d)\) | \(\mathcal{O}(n^2)\) |
Linear Attention | \(\mathcal{O}(nd^2)\) | \(\mathcal{O}(nd)\) |