[ 10. Knowledge Graph Embeddings ]
( 참고 : CS224W: Machine Learning with Graphs )
Contents
- 10-1. RGCN (Relational GCN)
- 10-2. Knowledge Graphs
- 10-3. Knowledge Graph Completion
Heterogeneous Graphs
will extend “one edge” type to “multiple edge” types
Heterogeneous graphs
- \(\boldsymbol{G}=(\boldsymbol{V}, \boldsymbol{E}, \boldsymbol{R}, \boldsymbol{T})\).
- 1) NODES with node types : \(v_{i} \in V\)
- 2) EDGES with relation types : \(\left(v_{i}, r, v_{j}\right) \in E\)
- 3) NODE TYPE : \(T\left(v_{i}\right)\)
- 4) RELATION TYPE : \(r \in R\)
Heterogeneous graphs models :
- 1) Relational GCNs
- 2) Knowledge Graphs
- 3) Embeddings for KG Completion
10-1. RGCN (Relational GCN)
Heterogeneous graphs
- multiple edges
- multiple relation types
Solution :
- use different NN for different relation types
Relational GCN :
\(\mathbf{h}_{v}^{(l+1)}=\sigma\left(\sum_{r \in R} \sum_{u \in N_{v}^{r}} \frac{1}{c_{v, r}} \mathbf{W}_{r}^{(l)} \mathbf{h}_{u}^{(l)}+\mathbf{W}_{0}^{(l)} \mathbf{h}_{v}^{(l)}\right)\).
- where \(c_{v, r}=\mid N_{v}^{r} \mid\)
decompose as “Message + Aggregation”
- 1) Message :
- 1-1) neighbor : \(\mathbf{m}_{u, r}^{(l)}=\frac{1}{c_{v, r}} \mathbf{W}_{r}^{(l)} \mathbf{h}_{u}^{(l)}\).
- 1-2) self loop : \(\mathbf{m}_{v}^{(l)}=\mathbf{W}_{0}^{(l)} \mathbf{h}_{v}^{(l)}\).
- 2) Aggregation :
- \(\mathbf{h}_{v}^{(l+1)}=\sigma\left(\operatorname{Sum}\left(\left\{\mathbf{m}_{u, r}^{(l)}, u \in N(v)\right\} \cup\left\{\mathbf{m}_{v}^{(l)}\right\}\right)\right)\).
Scalability of RGCN
-
number of layers : \(L\)
-
each relation has \(L\) matrices : \(\mathbf{W}_{r}^{(1)}, \mathbf{W}_{r}^{(2)} \cdots \mathbf{W}_{r}^{(L)}\)
-
dimension of \(\mathbf{W}_{r}^{(l)}\) is \(d^{(l+1)} \times d^{(l)}\)
-
overfitting issue! how to solve?
-
-
solution : regularize \(\mathbf{W}_{r}^{(l)}\)
- 1) block diagonal matrices
- 2) basis/dictionary learning
(1) Block Diagonal Matrices
Use block diagonal matrices for \(\mathbf{W}_{r}^{}\)
if we use \(B\) low-dim matrices… number of params will becom
- before ) \(d^{(l+1)} \times d^{(l)}\)
- after ) \(B \times \frac{d^{(l+1)}}{B} \times \frac{d^{(l)}}{B}\)
(2) Basis Learning
Share weights across different relations!
Matrix = “linear combination” of “basis” transformation
-
\(\mathbf{W}_{r}=\sum_{b=1}^{B} a_{r b} \cdot \mathbf{V}_{b}\).
- \(\mathbf{V}_{b}\) : basis matrices ( shared across all relations )
-
\(a_{r b}\) : importance weight of matrix \(\mathbf{V}_{b}\)
- \(B\) : number of basis to use
-
thus, only need to learn \(\left\{a_{r b}\right\}_{b=1}^{B}\)
(3) Example : Node Classification
-
goal : predict the class of node \(A\), from \(k\) classes
-
final layer (prediction head) :
\(\mathbf{h}_{A}^{(L)} \in \mathbb{R}^{k}\) = probability of each classes
(4) Example : Link Prediction
Stratified Split….with 4 categories of edges
- 1-1) [TRAIN] message edges
- 1-2) [TRAIN] supervision edges
- 2) [VALIDATION] edges
- 3) [TEST] edges
Example )
-
training message edges : all others
-
training supervision edge : (\(E , r_3, A\) )
\(\rightarrow\) score this edge
- final layer of \(E\) : \(\mathbf{h}_{E}^{(L)}\)
-
final layer of \(A\) : \(\mathbf{h}_{A}^{(L)}\)
-
score function : \(f_{r}: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}\)
-
final score
= \(f_{r_{1}}\left(\mathbf{h}_{E}, \mathbf{h}_{A}\right)=\mathbf{h}_{E}^{T} \mathbf{W}_{r_{1}} \mathbf{h}_{A}, \mathbf{W}_{r_{1}} \in \mathbb{R}^{d \times d}\)
Training Procedure :
step 1) use RGCN to score “training supervision edge”
- ex) \(\left(E, r_{3}, A\right)\)
step 2) create negative edge by perturbing supervision edge
- ex) \(\left(E, r_{3}, B\right), \left(E, r_{3}, D\right)\)
- note that negative edges should also NOT BELONG to TRAINING MESSAGE EDGES!
step 3) use GNN model to score negative edges
step 4) optimize CE loss
-
\[\ell=-\log \sigma\left(f_{r_{3}}\left(h_{E}, h_{A}\right)\right)-\log \left(1-\sigma\left(f_{r_{3}}\left(h_{E}, h_{B}\right)\right)\right)\]
- (1) maximize “training supervision edge”
- (2) minimize “negative edge”
Evaluation Procedure
Example :
-
validation edge : \((E,r_3,D)\)
-
score of “validation edge” should be higher than
\((E,r_3,v)\) , where it is NOT in the training message/supervision edges
Procedure :
step 1) calculate the score of \((E,r_3,D)\)
step 2) calculate score of all negative edges \(\left\{\left(E, r_{3}, v\right) \mid v \in\{B, F\}\right\}\)
step 3) obtain ranking (RK) of \((E,r_3,D)\)
step 4) calculate metrics
- 1) Hits@k = \(1[RK \neq k]\)
- 2) Reciprocal Rank = \(1/RK\)
10-2. Knowledge Graphs
Knowledge in graph form
- nodes : ENTITIES
- label of nodes : TYPES
- edges : RELATIONSHIPS
KG(Knowledge Graph) is an example of heterogeneous graph
Example )
10-3. Knowledge Graph Completion
Models
- 1) TransE
- 2) TransR
- 3) DistMul
- 4) ComplEx
(1) KG Completion Task
task : for a given (HEAD, RELATION), predict “missing tails”
Example )
Notation
- edges are represented as triples \((h,r,t)\)
- \(h\) : head
- \(r\) : relation
- \(t\) : tail
- given \((h,r,t)\), make embedding \((h,r)\) to be close to embedding \(t\)
- Q1) how to embed?
- Q2) how to define closeness?
(2) Connectivity Patterns in KG
-
Symmetric :
- notation : \(r(h, t) \Rightarrow r(t, h)(r(h, t) \Rightarrow \neg r(t, h)) \quad \forall h, t\)
- example : roommate
- Inverse :
- notation : \(r_{2}(h, t) \Rightarrow r_{1}(t, h)\).
- example : (advisor, advisee)
- Composition (Transitive)
- notation : \(r_{1}(x, y) \wedge r_{2}(y, z) \Rightarrow r_{3}(x, z) \quad \forall x, y, z\).
- example : (mother’s husband = father)
- 1-to-N
- notation : \(r\left(h, t_{1}\right), r\left(h, t_{2}\right), \ldots, r\left(h, t_{n}\right) \text { are all True. }\)
- example : r = StudentsOf
(3-1) TransE
- idea : \(\mathbf{h}+\mathbf{r} \approx \mathbf{t}\)
- scoring function : \(f_{r}(h, t)=- \mid \mid \mathbf{h}+\mathbf{r}-\mathbf{t} \mid \mid\)
- loss : max margin loss
Relation Types
-
Antisymmetric (O)
- \(h+r=t\), but \(t+r \neq h\)
-
Inverse (O)
- \(h + r_2 =t\), we can set \(r_1 = -r_2\)
-
Composition (O)
- \[r_3 = r_1 + r_2\]
-
Symmetric (X)
- only if \(r=0\), \(h=t\)
-
1-to-N (X)
-
\(t_1\) and \(t_2\) will map same vector
although they are different entities
-
contradictory!
- \[t_1 = h+r = t_2\]
- \[t_1 \neq t_2\]
-
(3-2) TransR
TransE :
- translation of any relation in the SAME embedding space
TransR :
- translation of relation in the DIFFERENT embedding space
- notation
- model “entities” as vectors in the entity space \(\mathbb{R}^{d}\)
- model “each relation” as vector in relation space \(\mathbf{r} \in \mathbb{R}^{k}\)
- with \(\mathbf{M}_{r} \in \mathbb{R}^{k \times d}\) as projection matrix
- \(\mathbf{h}_{\perp}=\mathbf{M}_{r} \mathbf{h}, \mathbf{t}_{\perp}=\mathbf{M}_{r} \mathbf{t}\).
- score function : \(f_{r}(h, t)=-\left \mid \mid \mathbf{h}_{\perp}+\mathbf{r}-\mathbf{t}_{\perp}\right \mid \mid\)
Relation Types
-
Symmetric (O)
- \(\mathbf{r}=0, \mathbf{h}_{\perp}=\mathbf{M}_{r} \mathbf{h}=\mathbf{M}_{r} \mathbf{t}=\mathbf{t}_{\perp}\).
-
Assymetric (O)
-
\(\mathbf{r} \neq 0, \mathbf{M}_{r} \mathbf{h}+\mathbf{r}=\mathbf{M}_{r} \mathbf{t}\).
then, \(\mathbf{M}_{r} \mathbf{t}+\mathbf{r} \neq \mathbf{M}_{r} \mathbf{h}\)
-
- 1-to-N (O)
- we can learn \(\mathbf{M}_{r}\) so that \(\mathbf{t}_{\perp}=\mathbf{M}_{r} \mathbf{t}_{1}=\mathbf{M}_{r} \mathbf{t}_{2}\)
-
Inverse (O)
-
\[\mathbf{r}_{2} =-\mathbf{r}_{1}, \mathbf{M}_{r_{1}}=\mathbf{M}_{r_{2}}\]
then, \(\mathbf{M}_{r_{1}} \mathbf{t}+\mathbf{r}_{1} =\mathbf{M}_{r_{1}} \mathbf{h} \text { and } \mathbf{M}_{r_{2}} \mathbf{h}+\mathbf{r}_{2}=\mathbf{M}_{r_{2}} \mathbf{t}\)
-
\[\mathbf{r}_{2} =-\mathbf{r}_{1}, \mathbf{M}_{r_{1}}=\mathbf{M}_{r_{2}}\]
- Composition (O)
(3-3) DistMult
-
adopt BILINEAR modeling!
-
score function : \(f_{r}(h, t)=<\mathbf{h}, \mathbf{r}, \mathbf{t}>=\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \mathbf{t}_{i}\).
- can be seen as cosine similarity, between \(h\cdot r\) & \(t\)
Relation Types
- 1-to-N (O)
- \(\left\langle\mathbf{h , r}, \mathbf{t}_{1}\right\rangle=\left\langle\mathbf{h}, \mathbf{r}, \mathbf{t}_{2}\right\rangle\).
- Symmetric (O)
- \(f_{r}(h, t)=<\mathbf{h}, \mathbf{r}, \mathbf{t}>=\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \mathbf{t}_{i}=<\mathbf{t}, \mathbf{r}, \mathbf{h}>=f_{r}(t, h)\).
-
AntiSymmetric (X)
-
\(f_{r}(h, t)=<\mathbf{h}, \mathbf{r}, \mathbf{t}>=<\mathbf{t}, \mathbf{r}, \mathbf{h}>=f_{r}(t, h)\).
( \(r(h,t)\) & \(r(t,h)\) will always have same score )
-
-
Inverse (X)
-
\(f_{r_{2}}(h, t)=\left\langle\mathbf{h}, \mathbf{r}_{2}, \mathbf{t}\right\rangle=\left\langle\mathbf{t}, \mathbf{r}_{1}, \mathbf{h}\right\rangle=f_{r_{1}}(t, h)\).
( this means \(\mathbf{r}_{2}=\mathbf{r}_{1}\) )
-
-
Composition (X)
-
union of the hyperplane induced by multi-hops of relations
can not be expressed using a single hyperplane!
-
(3-4) ComplEx
- based on DistMult
- embeds entities&relations in COMPLEX vector space
- Score function : \(f_{r}(h, t)=\operatorname{Re}\left(\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\mathbf{t}}_{i}\right)\).
Relation Types
- Antisymmetric (O)
- High \(f_{r}(h, t)=\operatorname{Re}\left(\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\mathbf{t}}_{i}\right)\).
- Low \(f_{r}(t, r)=\operatorname{Re}\left(\sum_{i} \boldsymbol{t}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\boldsymbol{h}}_{i}\right)\)
- Symmetric (O)
- when \(Im(r)=0\)….
- \(\begin{aligned} f_{r}(h, t)&=\operatorname{Re}\left(\sum_{i} \mathbf{h}_{i} \cdot \mathbf{r}_{i} \cdot \overline{\mathbf{t}}_{i}\right)=\sum_{i} \operatorname{Re}\left(\mathbf{r}_{i} \cdot \mathbf{h}_{i} \cdot \overline{\mathbf{t}}_{i}\right) \\ &=\sum_{i} \mathbf{r}_{i} \cdot \operatorname{Re}\left(\mathbf{h}_{i} \cdot \overline{\mathbf{t}}_{i}\right)=\sum_{i} \mathbf{r}_{i} \cdot \operatorname{Re}\left(\overline{\mathbf{h}}_{i} \cdot \mathbf{t}_{i}\right)\\&=\sum_{i} \operatorname{Re}\left(\mathbf{r}_{i} \cdot \overline{\mathbf{h}}_{i} \cdot \mathbf{t}_{i}\right)=f_{r}(t, h) \end{aligned}\).
- Inverse (O)
- \(\mathbf{r}_{1}=\overline{\mathbf{r}}_{2}\).
- Complex conjugate of \(\mathbf{r}_{2}=\underset{\mathbf{r}}{\operatorname{argmax}} \operatorname{Re}(<\mathbf{h}, \mathbf{r}, \overline{\mathbf{t}}>)\) is exactly \(\mathbf{r}_{1}=\underset{\mathbf{r}}{\operatorname{argmax}} \operatorname{Re}(<\mathbf{t}, \mathbf{r}, \overline{\mathbf{h}}>)\).
- Composition (X)
- 1-to-N (O)