[ 3. Node Embedding ]
( 참고 : CS224W: Machine Learning with Graphs )
Contents
- 3-1. Node Embeddings
- 3-2. Random Walk
- 3-3. Embedding Entire Graphs
3-1. Node Embeddings
1) Feature Representation ( Embedding )
Goal : create an…
- 1) efficient
- 2) task-independent feature learning
- 3) with graphs!
Similar in Network \(\approx\) Similar in Embedding Space

With these embeddings…solve many tasks!
- ex) node classification, link prediction….

Node Embedding algorithms
- ex) Deep Walk
2) Encoder & Decoder
Notation
- graph \(G\)
- vertex set \(V\)
- adjacency matrix \(A\) ( binary )
Goal : encode nodes, such that….
“similarity in embedding space \(\approx\) similarity in graph”
( \(\operatorname{similarity}(u, v) \approx \mathbf{z}_{v}^{\mathrm{T}} \mathbf{z}_{u}\) )
key point : how do we DEFINE SIMILARITY??
Process
- 
    step 1) node —-(encoder)——> embeddings ( \(\operatorname{ENC}(v)=\mathbf{z}_{v}\) ) 
- 
    step 2) define “node similarity function” 
- 
    step 3) embedding—–(decoder)——> similarity score 
- 
    step 4) optimization ( in a way that \(\operatorname{similarity}(u, v) \approx \mathbf{z}_{v}^{\mathrm{T}} \mathbf{z}_{u}\) ) 
3) Shallow Encoding
- simplest ( just “embedding-lookup table” )
- \(\operatorname{ENC}(v)=\mathbf{z}_{v}=\mathbf{Z} \cdot v\).
    - where \(\mathbf{Z} \in \mathbb{R}^{d \times \mid \mathcal{V} \mid }\) ( what we want to learn )
- \(v \in \mathbb{I}^{ \mid \mathcal{V} \mid }\) : indicator vector
 
- ex) Deep Walk, node2vec

4) How to define “Node Similarity”?
candidates : does/are 2 nodes…
- are linked?
- share neighbors?
- have similar roles in graph ( = structural roles )?
\(\rightarrow\) will learn “node similarity” that uses random walks!
Notes
- unsupervised / self-supervised tasks
    - node labels (X), node features (X)
 
- these embeddings are “task independent”
3-2. Random Walk
1) Introduction
Notation
- \(z_u\) : embedding of \(u\)
- \(P(v \mid z_u)\) : (predicted) probability of visiting \(v\), starting from \(u\) ( on random walks )
- \(R\) : random walk strategy
Random Walk
- start from arbitrary node
- move to adjacent(neighbor) nodes RANDOMLY
- Random Walk : “sequence of points visited” in this way
- \(z^T_u z_v\) \(\approx\) probability that \(u\) & \(v\) co-occur on random walk
- nearby nodes : via \(N_R(u)\)
    - neighborhood of \(u\), obtained by strategy \(R\)
 
2) Random Walk Embeddings
Random Walk Embeddings
- step 1) estimate \(P_R(v \mid u)\)
- step 2) optimize embeddings, to encode these random walk statistics
    - cosine similarity \((z_i, z_j)\) \(\propto P_R(v \mid u)\)
 
Advantages of Random Walk
- 
    1) Expressivity - captures both local & higher-order neighborhood info
 
- 
    2) Efficiency - 
        do not consider ALL nodes ( only consider “PAIRS” that co-occur in random walk ) 
 
- 
        
Feature Learning as Optimization
- goal :
    - learn \(f(u) = z_u\)
 
- objective function :
    - \(\max _{f} \sum_{u \in V} \log \mathrm{P}\left(N_{\mathrm{R}}(u) \mid \mathbf{z}_{u}\right)\).
 

3) Procedure
- 
    step 1) Run short fixed-length random walk - starting node : \(u\)
- random walk strategy : \(R\)
 
- 
    step 2) collect \(N_R(u)\) - ( = multiset of nodes visited )
 
- 
    step 3) optimize embedding - 
        \(\max _{f} \sum_{u \in V} \log \mathrm{P}\left(N_{\mathrm{R}}(u) \mid \mathbf{z}_{u}\right)\). ( equivalently, \(\mathcal{L}=\sum_{u \in V} \sum_{v \in N_{R}(u)}-\log \left(P\left(v \mid \mathbf{z}_{u}\right)\right)\) ) 
- 
        \(P\left(v \mid \mathbf{z}_{u}\right)\) : parameterize as softmax \(\rightarrow\) but TOO EXPENSIVE ( \(O(\mid V \mid^2)\) )… let’s approximate it via NEGATIVE SAMPLING 
 
- 
        
4) Negative Sampling
\(\log \left(\frac{\exp \left(\mathbf{z}_{u}^{\mathrm{T}} \mathbf{z}_{v}\right)}{\sum_{n \in V} \exp \left(\mathbf{z}_{u}^{\mathrm{T}} \mathbf{z}_{n}\right)}\right) \approx \log \left(\sigma\left(\mathbf{z}_{u}^{\mathrm{T}} \mathbf{z}_{v}\right)\right)-\sum_{i=1}^{k} \log \left(\sigma\left(\mathbf{z}_{u}^{\mathrm{T}} \mathbf{z}_{n_{i}}\right)\right), n_{i} \sim P_{V}\).
- 
    do not use all nodes only some “negative samples” \(n_i\) ( sampled not uniformly, but in a biased way ) 
- 
    biased way = “proportional to its degree” 
- 
    appropriate \(k\) - higher \(k\) : more robust & higher bias on negative events
- in practice, \(k=5 \sim 20\)
- optimize using SGD
 
5) Strategies of walking
- example) Deep Walk
- how to generalize?? “node2vec”
node2vec
- key point : biased 2nd order random walk to generate \(N_R(u)\)
- https://seunghan96.github.io/ne/ppt/4.node2vec/
3-3. Embedding Entire Graphs

ex) classify toxic/non-toxic molecules
1) Idea 1 (simple)
- standard graph embedding
- sum the node embeddings in the graph
- \(\boldsymbol{z}_{\boldsymbol{G}}=\sum_{v \in G} Z_{v}\).
2) Idea 2
- use “virtual node”
- this node will represent the entire graph!

3) Idea 3 ( Anonymous Walk Embeddings )

- do not consider which specific node it visisted!
    - = agnostic to the identity
- = anonymous
 
- # of anonymous walks grows exponentially
- ex) \(l=3\)
    - 5-dim representation
- 111,112,121,122,123
- \(Z_G[i]\) = prob of anonymous walk \(w_i\) in \(G\)
 
- sampling anonymous walks
    - generate \(m\) random walks
- appropriate \(m\) ?
        - \(m=\left[\frac{2}{\varepsilon^{2}}\left(\log \left(2^{\eta}-2\right)-\log (\delta)\right)\right]\).
            - \(\eta\) : # of anonymous walks of length \(l\)
 
 
- \(m=\left[\frac{2}{\varepsilon^{2}}\left(\log \left(2^{\eta}-2\right)-\log (\delta)\right)\right]\).
            
 
Learn Embeddings
- learn \(z_i\) ( = embedding of walk \(w_i\) )
- 
    learn \(Z_G\) ( = graph embedding, \(Z=\left\{z_{i}: i=1 \ldots \eta\right\}\) ) 
- 
    how to embed walks? - 
        learn to predict walks, that co-occur in \(\Delta\) size window 
- 
        \(\max \sum_{t=\Delta}^{T-\Delta} \log P\left(w_{t} \mid w_{t-\Delta}, \ldots, w_{t+\Delta}, z_{G}\right)\). \(\rightarrow\) sum the objective, over ALL nodes 
 
- 
        
Process
- step 1) run \(T\) different random walks, of length \(l\)
    - \(N_{R}(u)=\left\{w_{1}^{u}, w_{2}^{u} \ldots w_{T}^{u}\right\}\).
 
- step 2) optimize below
    - \(\max _{\mathrm{Z}, \mathrm{d}} \frac{1}{T} \sum_{t=\Delta}^{T-\Delta} \log P\left(w_{t} \mid\left\{w_{t-\Delta}, \ldots, w_{t+\Delta} \mathbf{z}_{\boldsymbol{G}}\right\}\right)\).
        - \(P\left(w_{t} \mid\left\{w_{t-\Delta}, \ldots, w_{t+\Delta}, z_{G}\right\}\right)=\frac{\exp \left(y\left(w_{t}\right)\right)}{\sum_{i=1}^{\eta} \exp \left(y\left(w_{i}\right)\right)}\).
- \(y\left(w_{t}\right)=b+U \cdot\left(\operatorname{cat}\left(\frac{1}{\Delta} \sum_{i=1}^{\Delta} \mathbf{z}_{i}, \mathbf{z}_{G}\right)\right)\).
            - \(\operatorname{cat}\left(\frac{1}{\Delta} \sum_{i=1}^{\Delta} \mathbf{z}_{i}, \mathbf{z}_{G}\right)\) : avg of walk embeddings in the window, concatenated with graph embedding
 
 
- just treat $z_G$ like $z_i$, when optimizing!
 
- \(\max _{\mathrm{Z}, \mathrm{d}} \frac{1}{T} \sum_{t=\Delta}^{T-\Delta} \log P\left(w_{t} \mid\left\{w_{t-\Delta}, \ldots, w_{t+\Delta} \mathbf{z}_{\boldsymbol{G}}\right\}\right)\).
        
- step 3) obtain graph embedding \(Z_G\)

4) How to use embeddings
by using embeddings of nodes ( \(z_i\) )… we can solve…
- cluster/community detection
    - cluster nodes
 
- node classification
    - predict labels of nodes
 
- link prediction
    - predict edges of 2 nodes
- how to use 2 node embeddings?
        - 1) concatenate
- 2) Hadamard product
- 3) sum / average
- 4) distance
 
 
- graph classification
    - get graph embedding \(Z_G\), by aggregating node embeddings
 
