[ 12. Frequent Subgraph Mining with GNNs ]

( 참고 : CS224W: Machine Learning with Graphs )


Contents

  • 12-1. Subgraphs & Motifs

  • 12-2. Neural Subgraph Matching
  • 12-3. Mining Frequent Subgraphs


1. Subgraphs & Motifs

(1) Introduction

def ) “building blocks” of networks

  • can “characterize” & “discriminate” networks


example :

figure2


(2) Subgraphs

2 ways to formalize subgraphs

  • 1) “NODE”-induced subgraph
  • 2) “EDGE”-induced subgraph


a) “NODE”-induced subgraph

  • subset of the NODES & all edges induced by the nodes
  • called “induced subgraph”
  • example )
    • Chemistry ( functional groups )


\[G^{\prime}=\left(V^{\prime}, E^{\prime}\right) \text { is a node induced subgraph iff }\]
  • \[V^{\prime} \subseteq V\]
  • \[E^{\prime}=\left\{(u, v) \in E \mid u, v \in V^{\prime}\right\}\]
  • \[G^{\prime} \text { is the subgraph of } G \text { induced by } V^{\prime}\]


b) “EDGE”-induced subgraph

  • subset of the EDGES & all corresponding nodes
  • called “non-induced subgraph” / “subgraph”
  • example )
    • Knowledge graphs (focus on edges representing logical relations)


\(G^{\prime}=\left(V^{\prime}, E^{\prime}\right) \text { is an edge induced subgraph iff }\).

  • \(E^{\prime} \subseteq E\).
  • \(V^{\prime}=\left\{v \in V \mid(v, u) \in E^{\prime} \text { for some } u\right\}\).


c) Graph Isomorphism

check whether 2 graphs are identical!


[ Definition ]

  • \(G_{1}=\left(V_{1}, E_{1}\right)\) and \(G_{2}=\left(V_{2}, E_{2}\right)\) are isomorphic

    if there exists a bijection \(f: V_{1} \rightarrow V_{2}\) such that \((u, v) \in\) \(E_{1}\) iff \((f(a), f(b)) \in E_{2}\)

    ( + \(f\) is called the isomorphism )

figure2


d) Subgraph Isomorphism

[ Definition ]

  • \(G_2\) is subgraph isomorphic to \(G_1\) ….

    if some “subgraph of \(G_2\)” is isomorhic to \(G_1\)

    ( = \(G_1\) is a subgraph of \(G_2\) )

\(\rightarrow\) NP-hard problem

figure2


e) Examples

non-isomorphic + connected + undirected + size 4

figure2


non-isomorphic + connected + directed + size 3

figure2


(3) Network Motifs

“(1) recurring, (2) significant (3) patterns of interconnections”


Definition

  • (1) Recurring : # of frequency

    \(\rightarrow\) “how to define frequency?”

  • (2) Significant : more frequent than random graph

    \(\rightarrow\) “how to define random graph?”

  • (3) Pattern : small (node-induced) subgraph


Why do we need motifs?

  • Help us understand how graphs work
  • Help us make predictions based on presence or lack of presence in a graph dataset


a) Subgraph FREQUENCY (SF)

Notation

  • \(G_T\) : Target graph ( = big graph )
  • \(G_Q\) : Query graph ( = small graph )


[ Graph-level SF ]

figure2


[ Node-level SF ]

figure2

  • \((G_Q,v)\) : “node-anchored” subgraph
  • robust to outliers


Want to find out “frequency of multiple subgraphs” in dataset!

\(\rightarrow\) treat \(G_T\) as a large dataset, with “disconnected components” ( = individual graphs )

figure2


b) Motif SIGNIFICANCE

more frequent than null-model ( = random model )

\(\rightarrow\) so, how to define “RANDOM graphs”?


[ Erdos-Renyi (ER) random graphs ]

  • \(G_{n,p}\) : undirected graph / \(n\) nodes / edge probability of \(p\)
  • can be disconnected

  • example :

    figure2


[ Configuration Model ]

  • goal : generate random graph…

    • with \(N\) nodes
    • with degree \(k_1, \cdots k_N\)
  • compare

    • (1) \(G^{\text{real}}\)
    • (2) \(G^{\text{random}}\)

    ( both have same degree sequence! )

figure2


Overview

  • intuition : Motifs are overrepresented in a network when compared to random graphs

  • 3 step

    • step 1) count motifs in \(G^{\text{real}}\)

    • step 2) generate random graph

      • with similar statistics

        ( ex. nodes / edges / degree sequences)

    • step 3) use statistical measures to evaluate “significance of each motif”

      • ex) Z-score


Z-score :

  • \(Z_i\) : captures “significance” of motif \(i\)
  • \(Z_{i}=\left(N_{i}^{\mathrm{real}}-\bar{N}_{i}^{\mathrm{rand}}\right) / \operatorname{std}\left(N_{i}^{\mathrm{rand}}\right)\).
    • \(N_{i}^{\text {real }}\) : # (motif \(i\) ) in graph \(G^{\text {real }}\)
    • \(\bar{N}_{i}^{\text {rand }}\) : average #(motifs \(i\) ) in random graph
  • Network significance profile (SP) : normalized Score


Network significance profile (SP) :

  • \(S P_{i}=Z_{i} / \sqrt{\sum_{j} Z_{j}^{2}}\).
  • vector of “normalized Z-scores”
  • dimension = # of types of motifs
  • meaning : relative significance of subgraphs


2. Neural Subgraph Matching

(1) Subgraph Matching

Question : is QUERY graph a subgraph of TARGET graph?

  • QUERY graph : connected
  • TARGET graph : can be disconnected

figure2


use GNN to “predict” subgraph isomorphism

  • binary prediction

  • use embeddings to decide if neighborhood of \(u\) is isomorphic to subgraph of neighborhood of \(v\)

  • do not only predict its existence,

    but also identify corresponding nodes \(u\) & \(v\)

figure2


a) Decomposing graphs ( into neighborhoods )

  • 1) for each node in \(G_T\) ….
    • obtain a k-hop neighborhood around the anchor
  • 2) for each node in \(G_Q\) ….
    • same
  • 3) embed all those neighborhoods with GNN
    • By computing the embeddings for the anchor


b) Order Embedding Space

  • mapping : graph \(A\) \(\rightarrow\) point \(Z_A\)
  • \(Z_A\) is non-negative in all dimensions
  • key point : Transitivity

figure2

  • Example

figure2


Why “order embedding space”?

  • 1) transitivity

    • if \(G_1\) is a subgraph of \(G_2\) and \(G_2\) is a subgraph of \(G_3\)

      \(\rightarrow\) then \(G_1\) is a subgraph of \(G_3\)

  • 2) anti-symmetry

    • if \(G_1\) is a subgraph of \(G_2\) and \(G_2\) is a subgraph of \(G_1\)

      \(\rightarrow\) then \(G_1\) is isomorphic to \(G_2\)

  • 3) closure under intersection

figure2


c) Order constraint

  • What loss function should we use, so that the learned order embedding reflects the subgraph relationship?
  • design loss functions based on the order constraint

figure2


d) Max-margin Loss

  • optimize by minimizing “Max-margin loss”

  • margin between \(G_q\) & \(G_t\)

    = \(E\left(G_{q}, G_{t}\right)=\sum_{i=1}^{D}\left(\max \left(0, z_{q}[i]-z_{t}[i]\right)\right)^{2}\)

figure2


\(E\left(G_{q}, G_{t}\right)=0\), when \(G_{q}\) is a subgraph of \(G_{t}\)

\(E\left(G_{q}, G_{t}\right)>0\), when \(G_{q}\) is not a subgraph of \(G_{t}\)


Trained in a way that…

  • 1) positive examples :
    • minimize \(E(G_q, G_t)\), when \(G_q\) is a subgraph of \(G_t\)
  • 2) negative examples :
    • minimize \(\text{max}(0,\alpha-E(G_q,G_t))\)


3. Mining Frequent Subgraphs

(1) Representation Learning

Goal : find most frequent size-\(k\) motifs!

Process

  • 1) enumerate all size-\(k\) connected subgraphs
  • 2) count # of each subgraph types


But computationally hard….

\(\rightarrow\) use “REPRESENTATION learning”

how?

  • task 1) Combinatorial explosion

    solution 1) organize the search space

  • task 2) Subgraph isomorphism

    solution 2) prediction using GNN


Settings

  • \(G_T\) : target graph
  • \(k\) : size parameter
  • \(r\) : desired number of results ( = TOP \(r\) graphs )


Task

  • among all possible graphs of \(k\) nodes,

    identify top \(r\) graphs with highest frequency in \(G_T\)

  • use “node-level” definition

figure2


(2) SPMiner

a) overview

  • SPMiner = neural model to identify frequent motifs

  • procedure

    • step 1) decompose
      • decompose into subgraphs
      • overlapping node-anchored neighborhoods ( ex. 2-hop )
    • step 2) encoder
      • embed subgraphs
      • caution : to “order embedding” space
        • can quickly find out the frequency of a given subgraph \(G_Q\)
    • step 3) search procedure
      • find frequent subgraphs, by “growing patterns”

    figure2


b) Motif Frequency Estimation with SPMiner

\(G_{N_i}\) = set of subgraphs of \(G_T\)


Task :

  • estimate frequency of \(G_Q\) ( = query graph = motif = red dot )
  • by counting the number of \(G_{N_i}\) ( = yellow dot )
  • such that their embeddings \(Z_{N_i}\) satisfy \(Z_Q \leq Z_{N_i}\)

figure2


[ Search Procedure ]

  • step 1) randomly select starting node \(u\) in \(G_T\)
    • set \(S=\{u\}\)
  • step 2) Grow a motif iteratively…
    • choose a neighbor of node in \(S\) & add it to \(S\)
    • grow motifs to find larger motifs
  • step 3) Terminate
    • when reached a desired motif size
    • tesult : “subgraph of target graph induced by \(S\)”


figure2

figure2


[ Choosing the next node ]

  • Total Violation of a subgraph \(G\)

    = # of neighborhoods, that do not contain \(G\)

    = # of neighborhoods, that do not satisfy \(Z_Q \leq Z_{N_i}\)

  • minimizing Total Violation

    = maximizing frequency

  • use “GREEDY strategy”

    • at every step, add node that gives “smallest total violation”

Tags: ,

Categories:

Updated: