Node2vec : Scalable Feature Learning for Networks

1. Introduction

semi-supervised algorithm for scalable feature learning in networks
“learning continuous feature representations for nodes in network” ( into low dimension )
( classical approach : PCA, Multi-Dimensional Scaling )
use 2nd order random walk approach to generate network neighborhoods ( contribution : defning a flexible notion of a node’s network neighborhood )
use feature representations not only of nodes, but also “edges”

[ KEY CONTRIBUTION ]

efficient scalable algorithm for feature learning in networks (using SGD)
provides flexibilty in discovering representations
extend node2vec from ‘nodes’ -> ‘edges’
evaluate node2vec for (1) multi-label classification & (2) link prediction
- (1) multi-label classification : which class does each node belongs to
- (2) link prediction : predict whether there is a connection between two nodes

2. Feature learning Framework

applicable to any (un)directed & (un)weighted network
extend Skip-Gram architecture to networks
** skip-gram : https://github.com/seunghan96/datascience/blob/master/%5B%EC%97%B0%EA%B5%AC%EC%8B%A4%EC%9D%B8%ED%84%B4%5DComputing_Science_and_Engineering/1%EC%A3%BC%EC%B0%A8/DeepWalk_Online_Learning_of_Social_Representations.md
propose a randomized procedure that samples many different neighbors of a source node!
network : $G = (V,E)$
mapping function (from nodes -> feature representations) : $f : V \rightarrow \mathbb{R}^d$
( d : number of dimension of feature representation )
( $f$ : matrix size of $|V|\times d$ )
network neighborhood of node u ( generated by neighborhood sampling strategy S ): $N_s(u) \subset V$

[ Objective Function ]

maximize log-probability of observing a network neighborhood ( $N_s(u)$ ) for a node u, conditioned on its feature representation( = f(u) )

$\underset{f}{max} \sum_{u \in V}^{ }logPr(N_s(u)|f(u))$

[ Two Assumptions ]

1) Conditional Independence

factorize the likelihood, by assuming independence among likelihood of obesrving a neighborhood nodes

$Pr(N_s(u)|f(u)) = \prod_{n_i \in N_s(u)}^{ } Pr(n_i|f(u))$

2) Symmetry in feature space

source node & neighborhood node -> have symmetric effect over each other in feature space
conditional likelihood of source-neighborhood node pair ( as a softmax ) :

$Pr(n_i|f(u)) = \frac{exp(f(n_i) \cdot f(u))}{\sum_{v \in V}^{ }exp(f(n_i) \cdot f(u))}$

with these two assumption, the objective function can be simplified into

$\underset{f}{max} \sum_{u \in V}^{ } [ -logZ_u+ \sum_{n_i \in N_S(u)}f(n_i) \cdot f(u)]$

( in the above, $Z_u = \sum_{v \in V}^{ }exp(f(u)\cdot f(v))$ is too expensive to compute! use negative sampling )

** negative sampling : https://github.com/seunghan96/datascience/blob/master/%5B%EC%97%B0%EA%B5%AC%EC%8B%A4%EC%9D%B8%ED%84%B4%5DComputing_Science_and_Engineering/4%EC%A3%BC%EC%B0%A8/Negative_Sampling.md

3-1. Classic search strategies

sample neighbors of a source node as a form of local search

[ two kind of similarities ]

1) homophily

highly interconnected -> should be embedded closely
macro view
infer communities based on distances
( ex. u & s1 : community A, s8 & s9 : community B )
2) structural equivalence
similar structural roles -> embedded together
micro view( example with the image above : u & s6 )
does not emphasize connectivity!
( ex. u & s6 : act as hubs of communities )

[ two search algorithms ]

two sampling strategies for generating neighborhood sets N(s) of k nodes ( BFS & DFS )
(https://i.stack.imgur.com/vm0sn.png)

a. Breadth-first Sampling(BFS)

neighborhood : only immediate neighbors of the source node
micro-view
for ‘structural equivalence’

b. Depth-first Sampling(DFS)

neighborhood : nodes sequentially sampled at increasing distance from the source node
macro-view
for ‘homophily’

** BFS & DFS implementation : (https://github.com/seunghan96/datascience/tree/master/Data_Structure/2.Algorithm/Graph_Algorithm)

3-2. node2vec

interporlate between BFS & DFS
by developing a flexible biased random walk procedure ( that can explore neighborhoods both in BFS & DFS fashion )

(1) Random Walks

$\pi_{vx}$ : unnormalized transition probability between nodes v & x
Z : normalizing constant

(2) Search bias $\alpha$

simple way to bias randm walk : based on edge weights -> hard to find different types of network structure
( REMEMBER! homophily & structure equivalence are not in trade-off ! So, how to achieve both? )

[ Second order random walk with 2 parameters, p & q ]

Example

passed the edge (t,v) just before! ( previous state : node t & current state : node v )
to decide next step, evaluate transition probability $\pi_{vx}$
set the unnormalized transition probability to $\pi_{vx} = \alpha_{pq}(t,x)\cdot w_{vx}$
$\pi_{vx}$ </a> $d_{tx}$ is the shortest path distance between node t & x -> minimum 0, maximum 2) $\pi_{vx}$ </a> https://www.mdpi.com/algorithms/algorithms-12-00012/article_deploy/html/images/algorithms-12-00012-g004.png
parameters p & q : control how fast the random walk explores!
these two parameters allow the search to interpolate between BFS & DFS

A. Return parameter, p

controls the likelihood of “immediately revisiting a node in the walk”
high p -> less likely to sample the pre-visited nodes ( in the following two steps )
low p -> keep the walk “local” close to the starting node

B. In-out parameter, q

allows the search to differentiate between ‘inward & outward’ nodes
q > 1 : to nodes close to node ‘t’
q < 1 : get further from node ‘t’ => encourages exploration

(3) node2vec algorithm

$\pi_{vx}$ </a>

Twitter Facebook LinkedIn

(paper) node2vec, Scalable Feature Learning for Networks

Seunghan Lee