LongNet: Scaling Transformers to 1,000,000,000 Tokens

Ding, Jiayu, et al. "Longnet: Scaling transformers to 1,000,000,000 tokens." arXiv preprint arXiv:2307.02486 (2023).

( https://arxiv.org/pdf/2307.02486 )

참고:

Background
Improving Attention Mechanism
1. Standard Attention
2. Dilated Attention
3. Mixture of Dilated Attentions
4. Multi-head Dilated Attention

1. Background

Modeling long sequences is crucial!

Limitation: High computational complexity

\(\rightarrow\) Difficult to scale up the context length.

Quadratic dependency

Sparsification = Remove rows from each segment based on a hyperparameter \(r\)

Controls the distance between each removed row
Each segment can be calculated in parallel \(\rightarrow\) Distributed training on multiple GPUs.

Q) Information loss by dilation?

\(\rightarrow\) Use mixture of dilated attentions

All of the different dilated attentions can be computed in parallel
Provide the model with diverse and full information that captures both short-range and long-range information.

To further diverse the captured information ( in addition to the mixture of dilate attentions )

\(\rightarrow\) Use multi-head dilated attention blocks