TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Wang, Haiyang, et al. "TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters." ICLR 2025

참고:

1. Motivation

Training Transformers from scratch becomes increasingly costly!

Token-parameter interactions

Calculated via linear projection (fixed size of parameters)

\(\rightarrow\) Necessitating training from scratch when increasing model size :(

Solution?

Create a fully attention-based model, including token-parameter interactions

\(\rightarrow\) More flexible architecture that supports incremental # parameter increases

Step 1) Feed input tokens to token-parameter attention block
- [Input] Input tokens (= Query) & parameters (= Key param, Value param)
- [Output] Used as the inputs for the self-attention block (Q, K, and V)
( Step 2 = same as Transformer )
Step 3) Replace FFN with token-parameter attention block
- Query: Output from the self-attention block
- K,V: Different parameters

\(\text { Attention }(Q, K, V)=\operatorname{softmax}\left[\frac{Q \cdot K^{\top}}{\sqrt{d}}\right] \cdot V\).

\(\text { Pattention }\left(X, K_P, V_P\right)=\Theta\left(X \cdot K_P^{\top}\right) \cdot V_P\).