6. Graph Attention Networks (GATs)

GCN : treates all neighbors EQUALLY

GAT : assing DIFFERENT ATTENTION SCORE

2 varinants

6-1. GAT

attention mechanism of node pair \((i,j)\)

\(\alpha_{i j}=\frac{\exp \left(\operatorname{LeakyReLU}\left(\mathbf{a}^{T}\left[\mathbf{W h}_{i} \mid \mid \mathbf{W h}_{j}\right]\right)\right)}{\sum_{k \in N_{i}} \exp \left(\operatorname{LeakyReLU}\left(\mathbf{a}^{T}\left[\mathbf{W h}_{i} \mid \mid \mathbf{W h}_{k}\right]\right)\right)}\).

final output features of each node :

\(\mathbf{h}_{i}^{\prime}=\sigma\left(\sum_{j \in N_{i}} \alpha_{i j} \mathbf{W h}_{j}\right)\).

Multi-head Attention

Concatenate ( or Average features )

( concatenate )
- \(\mathbf{h}_{i}^{\prime} = \mid \mid _{k=1}^{K} \sigma\left(\sum_{j \in N_{i}} \alpha_{i j}^{k} \mathbf{W}^{k} \mathbf{h}_{j}\right)\).
( average )
- \(\mathbf{h}_{i}^{\prime} =\sigma\left(\frac{1}{K} \sum_{k=1}^{K} \sum_{j \in N_{i}} \alpha_{i j}^{k} \mathbf{W}^{k} \mathbf{h}_{j}\right)\).

Properties of GAT

(1) parallizeable ( efficient )
(2) can deal with nodes with different degrees

& assign correspoding weights to their neighbors
(3) can be applied to inductive learning problems

\(\rightarrow\) outperforms GCN!

also uses multi-head attention

GAT vs GaAN : for computing attention coefficients…

Assigns different weights for different heads,

by computing additional soft gate ( = gated attention aggregator )