[ 10. Knowledge Graph Embeddings ]

( 참고 : CS224W: Machine Learning with Graphs )

10-1. RGCN (Relational GCN)
10-2. Knowledge Graphs
10-3. Knowledge Graph Completion

Heterogeneous Graphs

will extend “one edge” type to “multiple edge” types

Heterogeneous graphs

\(\boldsymbol{G}=(\boldsymbol{V}, \boldsymbol{E}, \boldsymbol{R}, \boldsymbol{T})\).
- 1) NODES with node types : \(v_{i} \in V\)
- 2) EDGES with relation types : \(\left(v_{i}, r, v_{j}\right) \in E\)
- 3) NODE TYPE : \(T\left(v_{i}\right)\)
- 4) RELATION TYPE : \(r \in R\)

Heterogeneous graphs models :

1) Relational GCNs
2) Knowledge Graphs
3) Embeddings for KG Completion

10-1. RGCN (Relational GCN)

Heterogeneous graphs

multiple edges
multiple relation types

Solution :

use different NN for different relation types

Relational GCN :

\(\mathbf{h}_{v}^{(l+1)}=\sigma\left(\sum_{r \in R} \sum_{u \in N_{v}^{r}} \frac{1}{c_{v, r}} \mathbf{W}_{r}^{(l)} \mathbf{h}_{u}^{(l)}+\mathbf{W}_{0}^{(l)} \mathbf{h}_{v}^{(l)}\right)\).

where \(c_{v, r}=\mid N_{v}^{r} \mid\)

decompose as “Message + Aggregation”

1) Message :
- 1-1) neighbor : \(\mathbf{m}_{u, r}^{(l)}=\frac{1}{c_{v, r}} \mathbf{W}_{r}^{(l)} \mathbf{h}_{u}^{(l)}\).
- 1-2) self loop : \(\mathbf{m}_{v}^{(l)}=\mathbf{W}_{0}^{(l)} \mathbf{h}_{v}^{(l)}\).
2) Aggregation :
- \(\mathbf{h}_{v}^{(l+1)}=\sigma\left(\operatorname{Sum}\left(\left\{\mathbf{m}_{u, r}^{(l)}, u \in N(v)\right\} \cup\left\{\mathbf{m}_{v}^{(l)}\right\}\right)\right)\).

Scalability of RGCN

number of layers : \(L\)
each relation has \(L\) matrices : \(\mathbf{W}_{r}^{(1)}, \mathbf{W}_{r}^{(2)} \cdots \mathbf{W}_{r}^{(L)}\)
- dimension of \(\mathbf{W}_{r}^{(l)}\) is \(d^{(l+1)} \times d^{(l)}\)
- overfitting issue! how to solve?
solution : regularize \(\mathbf{W}_{r}^{(l)}\)
- 1) block diagonal matrices
- 2) basis/dictionary learning

(1) Block Diagonal Matrices

Use block diagonal matrices for \(\mathbf{W}_{r}^{}\)

if we use \(B\) low-dim matrices… number of params will becom

before ) \(d^{(l+1)} \times d^{(l)}\)
after ) \(B \times \frac{d^{(l+1)}}{B} \times \frac{d^{(l)}}{B}\)

(2) Basis Learning

Share weights across different relations!

Matrix = “linear combination” of “basis” transformation

\(\mathbf{W}_{r}=\sum_{b=1}^{B} a_{r b} \cdot \mathbf{V}_{b}\).
- \(\mathbf{V}_{b}\) : basis matrices ( shared across all relations )
- \(a_{r b}\) : importance weight of matrix \(\mathbf{V}_{b}\)
- \(B\) : number of basis to use
thus, only need to learn \(\left\{a_{r b}\right\}_{b=1}^{B}\)

(3) Example : Node Classification

goal : predict the class of node \(A\), from \(k\) classes
final layer (prediction head) :

\(\mathbf{h}_{A}^{(L)} \in \mathbb{R}^{k}\) = probability of each classes

(4) Example : Link Prediction

Stratified Split….with 4 categories of edges

1-1) [TRAIN] message edges
1-2) [TRAIN] supervision edges
2) [VALIDATION] edges
3) [TEST] edges

Example )

training message edges : all others
training supervision edge : (\(E , r_3, A\) )

\(\rightarrow\) score this edge
- final layer of \(E\) : \(\mathbf{h}_{E}^{(L)}\)
- final layer of \(A\) : \(\mathbf{h}_{A}^{(L)}\)
- score function : \(f_{r}: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}\)
- final score
  
  = \(f_{r_{1}}\left(\mathbf{h}_{E}, \mathbf{h}_{A}\right)=\mathbf{h}_{E}^{T} \mathbf{W}_{r_{1}} \mathbf{h}_{A}, \mathbf{W}_{r_{1}} \in \mathbb{R}^{d \times d}\)

Training Procedure :

step 1) use RGCN to score “training supervision edge”

ex) \(\left(E, r_{3}, A\right)\)

step 2) create negative edge by perturbing supervision edge

ex) \(\left(E, r_{3}, B\right), \left(E, r_{3}, D\right)\)
note that negative edges should also NOT BELONG to TRAINING MESSAGE EDGES!

step 3) use GNN model to score negative edges

step 4) optimize CE loss

\[\ell=-\log \sigma\left(f_{r_{3}}\left(h_{E}, h_{A}\right)\right)-\log \left(1-\sigma\left(f_{r_{3}}\left(h_{E}, h_{B}\right)\right)\right)\]
- (1) maximize “training supervision edge”
- (2) minimize “negative edge”

Evaluation Procedure

Example :

validation edge : \((E,r_3,D)\)
score of “validation edge” should be higher than

\((E,r_3,v)\) , where it is NOT in the training message/supervision edges

Procedure :

step 1) calculate the score of \((E,r_3,D)\)

step 2) calculate score of all negative edges \(\left\{\left(E, r_{3}, v\right) \mid v \in\{B, F\}\right\}\)

step 3) obtain ranking (RK) of \((E,r_3,D)\)

step 4) calculate metrics

1) Hits@k = \(1[RK \neq k]\)
2) Reciprocal Rank = \(1/RK\)

10-2. Knowledge Graphs

Knowledge in graph form

nodes : ENTITIES
label of nodes : TYPES
edges : RELATIONSHIPS

KG(Knowledge Graph) is an example of heterogeneous graph

Example )

10-3. Knowledge Graph Completion

Models

1) TransE
2) TransR
3) DistMul
4) ComplEx

(1) KG Completion Task

task : for a given (HEAD, RELATION), predict “missing tails”

Example )

Notation

edges are represented as triples \((h,r,t)\)
- \(h\) : head
- \(r\) : relation
- \(t\) : tail
given \((h,r,t)\), make embedding \((h,r)\) to be close to embedding \(t\)
- Q1) how to embed?
- Q2) how to define closeness?

(2) Connectivity Patterns in KG

Symmetric :
- notation : \(r(h, t) \Rightarrow r(t, h)(r(h, t) \Rightarrow \neg r(t, h)) \quad \forall h, t\)
- example : roommate
Inverse :
- notation : \(r_{2}(h, t) \Rightarrow r_{1}(t, h)\).
- example : (advisor, advisee)
Composition (Transitive)
- notation : \(r_{1}(x, y) \wedge r_{2}(y, z) \Rightarrow r_{3}(x, z) \quad \forall x, y, z\).
- example : (mother’s husband = father)
1-to-N
- notation : \(r\left(h, t_{1}\right), r\left(h, t_{2}\right), \ldots, r\left(h, t_{n}\right) \text { are all True. }\)
- example : r = StudentsOf

(3-1) TransE

idea : \(\mathbf{h}+\mathbf{r} \approx \mathbf{t}\)
scoring function : \(f_{r}(h, t)=- \mid \mid \mathbf{h}+\mathbf{r}-\mathbf{t} \mid \mid\)
loss : max margin loss

Relation Types

Antisymmetric (O)
- \(h+r=t\), but \(t+r \neq h\)
Inverse (O)
- \(h + r_2 =t\), we can set \(r_1 = -r_2\)
Composition (O)
- \[r_3 = r_1 + r_2\]
Symmetric (X)
- only if \(r=0\), \(h=t\)
1-to-N (X)
- \(t_1\) and \(t_2\) will map same vector
  
  although they are different entities
- contradictory!
  - \[t_1 = h+r = t_2\]
  - \[t_1 \neq t_2\]

(3-2) TransR

TransE :

translation of any relation in the SAME embedding space

TransR :

translation of relation in the DIFFERENT embedding space
notation
- model “entities” as vectors in the entity space \(\mathbb{R}^{d}\)
- model “each relation” as vector in relation space \(\mathbf{r} \in \mathbb{R}^{k}\)
  - with \(\mathbf{M}_{r} \in \mathbb{R}^{k \times d}\) as projection matrix

\(\mathbf{h}_{\perp}=\mathbf{M}_{r} \mathbf{h}, \mathbf{t}_{\perp}=\mathbf{M}_{r} \mathbf{t}\).
score function : \(f_{r}(h, t)=-\left \mid \mid \mathbf{h}_{\perp}+\mathbf{r}-\mathbf{t}_{\perp}\right \mid \mid\)

Relation Types

Symmetric (O)
- \(\mathbf{r}=0, \mathbf{h}_{\perp}=\mathbf{M}_{r} \mathbf{h}=\mathbf{M}_{r} \mathbf{t}=\mathbf{t}_{\perp}\).
Assymetric (O)
- \(\mathbf{r} \neq 0, \mathbf{M}_{r} \mathbf{h}+\mathbf{r}=\mathbf{M}_{r} \mathbf{t}\).
  
  then, \(\mathbf{M}_{r} \mathbf{t}+\mathbf{r} \neq \mathbf{M}_{r} \mathbf{h}\)
1-to-N (O)
- we can learn \(\mathbf{M}_{r}\) so that \(\mathbf{t}_{\perp}=\mathbf{M}_{r} \mathbf{t}_{1}=\mathbf{M}_{r} \mathbf{t}_{2}\)
Inverse (O)
- \[\mathbf{r}_{2} =-\mathbf{r}_{1}, \mathbf{M}_{r_{1}}=\mathbf{M}_{r_{2}}\]
  then, \(\mathbf{M}_{r_{1}} \mathbf{t}+\mathbf{r}_{1} =\mathbf{M}_{r_{1}} \mathbf{h} \text { and } \mathbf{M}_{r_{2}} \mathbf{h}+\mathbf{r}_{2}=\mathbf{M}_{r_{2}} \mathbf{t}\)
Composition (O)

(3-3) DistMult

adopt BILINEAR modeling!
score function : \(f_{r}(h, t)=<\mathbf{h}, \mathbf{r}, \mathbf{t}>=\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \mathbf{t}_{i}\).
- can be seen as cosine similarity, between \(h\cdot r\) & \(t\)

Relation Types

1-to-N (O)
- \(\left\langle\mathbf{h , r}, \mathbf{t}_{1}\right\rangle=\left\langle\mathbf{h}, \mathbf{r}, \mathbf{t}_{2}\right\rangle\).
Symmetric (O)
- \(f_{r}(h, t)=<\mathbf{h}, \mathbf{r}, \mathbf{t}>=\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \mathbf{t}_{i}=<\mathbf{t}, \mathbf{r}, \mathbf{h}>=f_{r}(t, h)\).
AntiSymmetric (X)
- \(f_{r}(h, t)=<\mathbf{h}, \mathbf{r}, \mathbf{t}>=<\mathbf{t}, \mathbf{r}, \mathbf{h}>=f_{r}(t, h)\).
  
  ( \(r(h,t)\) & \(r(t,h)\) will always have same score )
Inverse (X)
- \(f_{r_{2}}(h, t)=\left\langle\mathbf{h}, \mathbf{r}_{2}, \mathbf{t}\right\rangle=\left\langle\mathbf{t}, \mathbf{r}_{1}, \mathbf{h}\right\rangle=f_{r_{1}}(t, h)\).
  
  ( this means \(\mathbf{r}_{2}=\mathbf{r}_{1}\) )
Composition (X)
- union of the hyperplane induced by multi-hops of relations
  
  can not be expressed using a single hyperplane!

(3-4) ComplEx

based on DistMult
embeds entities&relations in COMPLEX vector space

Score function : \(f_{r}(h, t)=\operatorname{Re}\left(\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\mathbf{t}}_{i}\right)\).

Relation Types

Antisymmetric (O)
- High \(f_{r}(h, t)=\operatorname{Re}\left(\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\mathbf{t}}_{i}\right)\).
- Low \(f_{r}(t, r)=\operatorname{Re}\left(\sum_{i} \boldsymbol{t}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\boldsymbol{h}}_{i}\right)\)
Symmetric (O)
- when \(Im(r)=0\)….
- \(\begin{aligned} f_{r}(h, t)&=\operatorname{Re}\left(\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\mathbf{t}}_{i}\right)=\sum_{i} \operatorname{Re}\left(\mathbf{r}_{i} \cdot \mathbf{h}_{i} \cdot \overline{\mathbf{t}}_{i}\right) \\ &=\sum_{i} \mathbf{r}_{i} \cdot \operatorname{Re}\left(\mathbf{h}_{i} \cdot \overline{\mathbf{t}}_{i}\right)=\sum_{i} \mathbf{r}_{i} \cdot \operatorname{Re}\left(\overline{\mathbf{h}}_{i} \cdot \mathbf{t}_{i}\right)\\&=\sum_{i} \operatorname{Re}\left(\mathbf{r}_{i} \cdot \overline{\mathbf{h}}_{i} \cdot \mathbf{t}_{i}\right)=f_{r}(t, h) \end{aligned}\).
Inverse (O)
- \(\mathbf{r}_{1}=\overline{\mathbf{r}}_{2}\).
- Complex conjugate of \(\mathbf{r}_{2}=\underset{\mathbf{r}}{\operatorname{argmax}} \operatorname{Re}(<\mathbf{h}, \mathbf{r}, \overline{\mathbf{t}}>)\) is exactly \(\mathbf{r}_{1}=\underset{\mathbf{r}}{\operatorname{argmax}} \operatorname{Re}(<\mathbf{t}, \mathbf{r}, \overline{\mathbf{h}}>)\).
Composition (X)
1-to-N (O)

(3-5) Summary

Twitter Facebook LinkedIn

(CS224W) 10.Knowledge Graph Embeddings

Seunghan Lee

[ 10. Knowledge Graph Embeddings ]

Contents

Heterogeneous Graphs

10-1. RGCN (Relational GCN)

(1) Block Diagonal Matrices

(2) Basis Learning

(3) Example : Node Classification

(4) Example : Link Prediction

Training Procedure :

Evaluation Procedure

10-2. Knowledge Graphs

10-3. Knowledge Graph Completion

(1) KG Completion Task

(2) Connectivity Patterns in KG

(3-1) TransE

(3-2) TransR

(3-3) DistMult

(3-4) ComplEx

(3-5) Summary

You May Also Enjoy