4. Graph Convolutional Networks

GCN : to generalize convolutions to graph domains


4-1. Spectral Methods

SPECTRAL representation of graphs

4 classic models

  • (1) Spectral Network
  • (2) Chebnet
  • (3) GCN
  • (4) AGCN


4-1-1. Spectral Network

  • convolution operation in Fourier domain

    ( by computing eigendecomposition of graph Laplacian )

  • Convolution Operation

    • multiplication of “signal” & “filter”
      • signal : \(\mathbf{x} \in \mathbb{R}^{N}\)
      • filter : \(\mathrm{g}_{\theta}=\operatorname{diag}(\boldsymbol{\theta})\) ( parameterized by \(\mathrm{g}_{\theta}=\operatorname{diag}(\boldsymbol{\theta})\) )
    • \(\mathbf{g}_{\theta} \star \mathbf{x}=\mathbf{U g}_{\theta}(\Lambda) \mathbf{U}^{T} \mathbf{x}\).
      • \(\mathrm{U}\) : matrix of eigenvectors of \(\mathbf{L}\)
        • \(\mathbf{L}=\mathbf{I}_{N}- \mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}}=\mathbf{U} \Lambda \mathbf{U}^{T}\).
  • Limitations

    • (1) intense computation
    • (2) non-spatially localized filters


4-1-2. ChebNet

Approximate \(\mathbf{g}_{\theta}(\Lambda)\), by a TRUNCATED expansion,

in terms of Chebyshev polynomials \(\mathbf{T}_{k}(x)\), up to \(K^{th}\) order

( remove the need to compute the eigenvectors of Laplacian )


Spectral Network vs ChebNet

  • Spectral Network : \(\mathbf{g}_{\theta} \star \mathbf{x}=\mathbf{U g}_{\theta}(\Lambda) \mathbf{U}^{T} \mathbf{x}\).
  • ChebNet : \(\mathbf{g}_{\theta} \star \mathbf{x} \approx \sum_{k=0}^{K} \boldsymbol{\theta}_{k} \mathbf{T}_{k}(\tilde{\mathbf{L}}) \mathbf{x}\)
    • where \(\tilde{\mathbf{L}}=\frac{2}{\lambda_{\max }} \mathbf{L}-\mathbf{I}_{N} \cdot \lambda_{\max }\).
    • \(\mathbf{L} . \theta \in \mathbb{R}^{K}\) : vector of Chebyshev coefficients


Chebyshev Polynomial :

  • \(\mathbf{T}_{k}(\mathbf{x})=2 \mathbf{x} \mathbf{T}_{k-1}(\mathbf{x})\),

    where \(\mathbf{T}_{0}(\mathbf{x})=1\) and \(\mathbf{T}_{1}(\mathbf{x})=\mathbf{x}\)


4.1.3. GCN

ChebNet vs GCN

  • ChebNet : \(\mathbf{g}_{\theta} \star \mathbf{x} \approx \sum_{k=0}^{K} \boldsymbol{\theta}_{k} \mathbf{T}_{k}(\tilde{\mathbf{L}}) \mathbf{x}\).
  • pre-GCN (1) : \(\mathbf{g}_{\theta^{\prime}} \star \mathbf{x} \approx \theta_{0}^{\prime} \mathbf{x}+\theta_{1}^{\prime}\left(\mathbf{L}-\mathbf{I}_{N}\right) \mathbf{x}=\theta_{0}^{\prime} \mathbf{x}-\theta_{1}^{\prime} \mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}} \mathbf{x}\)
    • let \(K=1\) ( to alleviate overfitting on local neighbor )
    • approximate \(\lambda_{\max } \approx 2\)
  • pre-GCN (2) : \(\mathbf{g}_{\theta} \star \mathbf{x} \approx \theta\left(\mathbf{I}_{N}+\mathbf{D}^{-\frac{1}{2}} \mathbf{A D}{ }^{-\frac{1}{2}}\right) \mathbf{x} .\)
    • constrain the number of parameters with \(\theta=\theta_{0}^{\prime}=-\theta_{1}^{\prime}\)


Renormalization Trick

  • stacking operators above could lead to numerical instabilities
  • solution : renormalization trick
    • \(\mathbf{I}_{N}+\) \(\mathbf{D}^{-\frac{1}{2}} \mathbf{A D}{ }^{-\frac{1}{2}} \rightarrow \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\)
      • \(\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{N}\).
      • \(\tilde{\mathbf{D}}_{i i}=\sum_{j} \tilde{\mathbf{A}}_{i j}\).
    • generalization : \(\mathbf{Z}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X} \Theta\)


GCN summary : \(\mathbf{Z}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X} \Theta\)

  • \(\mathbf{X} \in \mathbb{R}^{N \times C}\) , wth \(C\) channels & \(F\) filters
  • \(\Theta \in \mathbb{R}^{C \times F}\) : matrix of filter parameters
  • \(\mathbf{Z} \in \mathbb{R}^{N \times F}\) : convolved signal matrix


4-1-4. AGCN

4-1-1 ~ 4-1-3 :

  • use the original graph structure

\(\rightarrow\) but….there may be IMPLICIT relation!

\(\rightarrow\) propose ADAPTIVE GCN


Goal : learn underlying relations

  • learns “residual” graph laplacian \(\mathbf{L}_{\text {res }}\)

  • add it to original laplacian : \(\widehat{\mathbf{L}}=\mathbf{L}+\alpha \mathbf{L}_{\text {res }},\)


Residual graph laplacian : \(\mathbf{L}_{\text {res }}\)

  • \(\mathbf{L}_{\text {res }} =\mathbf{I}-\widehat{\mathbf{D}}^{-\frac{1}{2}} \widehat{\mathbf{A}} \widehat{\mathbf{D}}\).
  • \(\widehat{\mathbf{D}} =\operatorname{degree}(\widehat{\mathbf{A}})\).
    • \(\widehat{\mathbf{A}}\) : computed with learned metric


learned metric : adaptive to task & input features

  • AGCN uses generalized Mahalanobis distance

  • \(D\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=\sqrt{\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)^{T} \mathbf{M}\left(\mathbf{x}_{i}-\mathbf{x}_{j}\right)}\),

    where \(\mathbf{M}\) is a learned parameter

    • \(\mathbf{M}=\mathbf{W}_{d} \mathbf{W}_{d}^{T}\),

      where \(\mathbf{W}_d\) is the transform basis to the adaptive space


AGCN

  • (1) calculates Gaussian kernel,
    • \(G_{x_{i}, x_{j}}=\exp \left(-D\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) /\left(2 \sigma^{2}\right)\right)\).
  • (2) then normalize G to get \(\hat{\mathbf{A}}\) ( = dense adjacency matrix )


4-2. Spatial Methods

Spectral methods

  • learned filters : depend on Laplacian eigenbasis

    ( = depends on graph structure )

  • Cons : trained model can not be applied to graph with DIFFERENT structure


Spatial methods

  • define convolutions, DIRECTLY on graph
  • challenges :
    • (1) defining the convolution operation with differently sized neighborhoods
    • (2) maintaing the local invariance of CNN


4-2-1. Neural FPS

DIFFERENT weight matrixes, for nodes with DIFFERENT degrees

  • \(\mathbf{x} =\mathbf{h}_{v}^{t-1}+\sum_{i=1}^{ \mid N_{v} \mid } \mathbf{h}_{i}^{t-1}\).

    • where \(\mathbf{h}_{v}^{t} =\sigma\left(\mathbf{x}\mathbf{W}_{\mathbf{t}}^{ \mid \mathbf{N}_{v} \mid }\right)\)

    • \(\mathbf{W}_{\mathbf{t}}^{ \mid \mathbf{N}_{v} \mid }\) : weight matrix, for nodes with degree \(\mid N_v \mid\)

  • adds (1) itself & (2) neighbors


Drawback : can not be applied to LARGE-SCALE graphs ( too large node degree… )


4-2-2. PATCHY-SAN

Simple Steps

  • (1) select & normalize \(k\) neighbors ( for each node )
  • (2) apply convolution
    • normalized neighborhood = receptive field


Detailed Steps

  • (1) Node Sequence Selection
    • do not use all nodes
    • select a sqeuence of nodes
  • (2) Neighborhood Asembly
    • BFS, until total of \(k\) neighbors are extracted
  • (3) Graph Normalziation
    • Give an order to nodes in receptive field
      • unordered graph -> vector space
    • goal : assign node from 2 different graphs similar relative positions, if they have similar structural roles
  • (4) Convolutional Architecture
    • receptive field = normalized neighborhoods
    • channels = node & edge attributes

figure2


4-2-3. DCNN

DIFFUSION CNN

  • transition matrices are used to define neighbors
  • Ex) node classification :
    • \(\mathbf{H}=\sigma\left(\mathbf{W}^{c} \odot \mathbf{P}^{*} \mathbf{X}\right)\).
      • \(\mathbf{X}\) : input features ( shape : \(N \times F\) )
      • \(\mathbf{P}^{*}\) : \(N \times K \times N\) Tensor, which contains power series
        • power series : \(\left\{\mathbf{P}, \mathbf{P}^{2}, \ldots, \mathbf{P}^{K}\right\}\)
      • \(\mathbf{P}\) : degree-normalized transition matrix from adjacency matrix \(\mathbf{A}\)
  • result : transformed into diffusion convolutional representation
    • shape ( of each entity ) : \(K \times F\)
      • \(K\) Hops of graph diffusion, over \(F\) features
    • shape ( of full ) : \(N \times K \times F\) ….notation : \(\mathbf{H}\)


For graph classification…

  • take average of nodes’ representation
  • \(\mathbf{H}=\sigma\left(\mathbf{W}^{c} \odot 1_{N}^{T} \mathbf{P}^{*} \mathbf{X} / N\right)\).


4-2-4. DGCN

Dual GCN … what is dual?

  • (1) LOCAL consistency
  • (2) GLOBAL consistency

\(\rightarrow\) Uses 2 convolutions ( one supervised loss & one unsupervised loss )


First convolution ( \(C o n v_{A}\) )

  • for LOCAL consistency

  • original GCN ( \(\mathbf{Z}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{X} \Theta\) )
  • concept : nearby nodes may have similar labels


Second convolution ( \(C o n v_{P}\) )

  • for GLOBAL consistency

  • replace adjacency matrix, with PPMI matrix

    ( = Positive Pointwise Mutual Information matrix… \(\mathbf{X}_{P}\) )

  • \(\mathbf{H}^{\prime}=\sigma\left(\mathbf{D}_{P}^{-\frac{1}{2}} \mathbf{X}_{P} \mathbf{D}_{P}^{-\frac{1}{2}} \mathbf{H} \Theta\right)\).

  • concept : nodes with similar context may have similar labels


Ensemble 2 convolutions!

Final Loss function :

  • \(L=L_{0}\left(\operatorname{Conv}_{A}\right)+\lambda(t) L_{\text {reg }}\left(\operatorname{Conv}_{A}, \operatorname{Conv}_{P}\right)\).
    • \(L_{0}\left(\operatorname{Conv}_{A}\right)\) : supervised loss ( = CE Loss )
      • \(L_{0}\left(\operatorname{Conv}_{A}\right)=-\frac{1}{ \mid y_{L} \mid } \sum_{l \in y_{L}} \sum_{i=1}^{c} Y_{l, i} \ln \left(\widehat{Z}_{l, i}^{A}\right)\).
    • \(L_{\text {reg }}\left(\operatorname{Conv}_{A}, \operatorname{Conv}_{P}\right)\) : unsupervised loss
      • \(L_{\text {reg }}\left(\operatorname{Conv}_{A}, \operatorname{Conv}_{P}\right)=\frac{1}{n} \sum_{i=1}^{n} \mid \mid \widehat{Z}_{i,:}^{P}-\widehat{Z}_{i,:}^{A} \mid \mid ^{2}\).


figure2


4-2-5. LGCN

Learnable GCN

  • based on “LGCL” (Learnable GC Layer) & “sub-graph training strategy”


LGCL (Learnable GCL)

  • use CNNs as aggregators

  • 2 step

    • (1) get top-k feature elements
    • (2) apply 1-d can
  • propagation step :

    • \(\widehat{H}_{t} =g\left(H_{t}, A, k\right)\).
      • \(g\) : k-largest node selection operation
    • \(H_{t+1} =c\left(\widehat{H}_{t}\right)\).
      • \(c\) : regular 1-D CNN
  • if less than \(k\)…..then zero-padding

  • finally,embedding of node \(x\) is inserted in the first row

    \(\rightarrow\) \(\widehat{M} \in \mathbb{R}^{(k+1) \times c}\)

figure2


4-2-6. MONET

MoNet ( = spatial domain model ), on non-Euclidean domains

  • examples :
    • (1) GCNN ( Geodesic CNN )
    • (2) ACNN ( Anisotropic CNN )
    • (3) GCNN
    • (4) DCNN


MoNet comuptes pseudo-coordinates \(\mathbf{u}(x,y)\)

  • use it as weighting function among these coordinates

\(\rightarrow\) \(D_{j}(x) f=\sum_{y \in N_{x}} w_{j}(\mathbf{u}(x, y)) f(y)\).


Then, do convolution, on non-Euclidean domains!

\(\rightarrow\) \((f \star g)(x)=\sum_{j=1}^{J} g_{j} D_{j}(x) f .\)

figure2


4-2-7. GraphSAGE

general inductive framework

  • generates embeddings, by “sampling” & “aggregating” features


Propagation step :

  • \(\mathbf{h}_{N_{v}}^{t} =\text { AGGREGATE }_{t}\left(\left\{\mathbf{h}_{u}^{t-1}, \forall u \in N_{v}\right\}\right)\).
  • \(\mathbf{h}_{v}^{t} =\sigma\left(\mathbf{W}^{t} \cdot\left[\mathbf{h}_{v}^{t-1} \mid \mid \mathbf{h}_{N_{v}}^{t}\right]\right)\).

\(\rightarrow\) but, do not use FULL set of neighbors! use UNIFORM SAMPLING


Aggregator functions :

  • (1) mean aggregator
  • (2) LSTM aggregator
  • (3) Pooling aggregator
  • ( any symmetric functions can be used )


Also, proposes an UNSUPERVISED loss function

  • encourages nearby nodes to have similar representations,

    while distant nodes have different representations

  • \(J_{G}\left(\mathbf{z}_{u}\right)=-\log \left(\sigma\left(\mathbf{z}_{u}^{T} \mathbf{z}_{v}\right)\right)-Q \cdot E_{v_{n} \sim P_{n}(v)} \log \left(\sigma\left(-\mathbf{z}_{u}^{T} \mathbf{z}_{v_{n}}\right)\right)\),

    • \(v\) : neighbor of node \(u\)
    • \(P_n\) : negative sampling distribution
    • \(Q\) : number of negative samples

Categories:

Updated: