2.Distributed Representations of Words and Phrases and their Compositionality (2013)

Abstract
Introduction
Skip-gram model
1. Hierarchical Softmax
2. Negative Sampling
3. Subsampling of Frequent words

Abstract

Skip-gram model :

This paper suggests extensions that improves both

How? By..

Limitation of word representation?

1) indifference to word order
2) inability to represent idiomatic phrases
- ex) airport “Air Canada” \(\neq\) Air + Canada

Distributed respresentation of words, using Skip-gram Model

Skip-gram model

Efficient estimation of word representations in vector space. (Mikolov et al, 2013)
unlike NN, no dense matrix multiplication \(\rightarrow\) efficient
can encode its meaning!
- ex) King - Man + Woman = Queen

This paper introduces extension of original Skip-gram model.

1) Subsampling of frequent words
- speed up (2x ~ 10x)
- improves accuracy of less frequent words
2) variant of NCE (Noise Contrastive Estimation)
- faster training
- better vector representations
( compared to Hierarchical Softmax )

Limitation of word representations : “Inability to represent idiomatic phrases”

ex) airport “Air Canada” \(\neq\) Air + Canada
using vectors to represent the “whole phrase” ( “Air Canada” as one phrase )
How to extend from “word-based” to “phrase-based” model? SIMPLE!
- 1) identify a large number of phrases
- 2) treat the phrases as individual tokens

CBOW vs Skip-gram

CBOW : predicting the current word based on the context
Skip-gram : predicting words within a certain range before and after the current word, given current word.

Training Skip-gram

How? “Maximize the average log probability”

\(\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p\left(w_{t+j} \mid w_{t}\right)\).
- \(c\) : size of the training context
- Larger \(c\) ( = more training examples ) \(\rightarrow\) higher accuracy, slower speed
softmax function

\(p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\).
- \(v_w\) : input
- \(v_w'\) : output
- \(W\) : number of words (vocab)
\(\rightarrow\) Impractical ! Too many words! ( \(10^5 \sim 1-^7\) terms)

\(\nabla \log p\left(w_{O} \mid w_{I}\right)\) is proportional to \(W\)

Inefficient of using all the \(W\) words?

softmax function : \(\hat{y}_{i}=P(i \mid c)=\frac{\exp \left(u_{i}^{T} v_{c}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)}\).

where \(u_i\) and \(v_j\) are the column vectors of embedded matrix

( let \(U=\left[u_{1}, u_{2}, \ldots, u_{k}, \ldots u_{W}\right]\) be a matrix composed of \(u_{k}\) column vectors )
loss : Cross Entropy Loss : \(J=-\sum_{i=1}^{W} y_{i} \log \left(\hat{y}_{i}\right)\).

where \(y\) : one-hot encoded vector & \(\hat{y}\) : softmax prediction

\(\begin{aligned}J&=-\sum_{i=1}^{W} y_{i} \log \left(\frac{\exp \left(u_{i}^{T} v_{c}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)}\right)\\ &=-\sum_{i=1}^{W} y_{i}\left[u_{i}^{T} v_{c}-\log \left(\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)\right)\right]\\ &=-y_{k}\left[u_{k}^{T} v_{c}-\log \left(\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)\right)\right]\end{aligned}\).

Thus…

\(\begin{aligned}\frac{\partial J}{\partial v_{c}}&=-\left[u_{k}-\frac{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right) u_{w}}{\sum_{x=1}^{W} \exp \left(u_{x}^{T} v_{c}\right)}\right]\\ &=\sum_{w=1}^{W}\left(\frac{\exp \left(u_{w}^{T} v_{c}\right)}{\sum_{x=1}^{W} \exp \left(u_{x}^{T} v_{c}\right)} u_{w}\right)-u_{k}\\ &=\sum_{w=1}^{W}\left(\hat{y}_{w} u_{w}\right)-u_{k}\end{aligned}\).

How to solve this?
- 1) Hierarchical Softmax
- 2) Negative Sampling

( details : refer to https://seunghan96.github.io/ne/03.Hierarchical_Softmax/ )

Key points

uses binary tree representation
instead of evaluating \(W\) output nodes, only need to evaluate \(\log_2(W)\) nodes
define a random walk that assings probabilities to words

( details : refer to https://seunghan96.github.io/ne/04.Negative_Sampling/ )

Key points

alternative to hierarchical softmax : NCE (Noise Contrastive Estimation)
- good model = able to differentiat data from noise, using log reg
- simplified NCE = Negative Sampling
Negative sampling : use \(k\) negative samples ( = wrong answers )
(Standard) Objective function

\(p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\).

\(\log p\left(w_{O} \mid w_{I}\right)=\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right) - \log \sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)\).
(Proposed) instead of \(\log p\left(w_{O} \mid w_{I}\right)\)…

\(\log \sigma\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)+\sum_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_{n}(w)}\left[\log \sigma\left(-v_{w_{i}}^{\prime}{ }^{\top} v_{w_{I}}\right)\right]\).
Distinguish target word \(w_O\) from draws from the noise distn \(P_n(w)\) ,

using logistic regression (sigmoid func)

most frequent words : occur hundreds of millions of times!

do not give even probability of getting sampled!

More Frequently occured, Less Probability of getting sampled
\(P\left(w_{i}\right)=1-\sqrt{\frac{t}{f\left(w_{i}\right)}}\).

where \(f(w_i)\) : frequency of word \(i\) & \(t\) : chosen threshold