2.Distributed Representations of Words and Phrases and their Compositionality (2013)
목차
- 
    Abstract 
- 
    Introduction 
- 
    Skip-gram model - Hierarchical Softmax
- Negative Sampling
- Subsampling of Frequent words
 
Abstract
Skip-gram model :
- 
    (1) “efficient” method for learning (2) “distributed vector representation” 
- 
    Not only just the co-occurence of the word But also captures its meaning! 
This paper suggests extensions that improves both
- 1) “quality of the vectors”
- 2) “training speed”
How? By..
- 1) SUBSAMPLING of frequent words
- 2) NEGATIVE SAMPLING ( alternative to hierarchical softmax )
Limitation of word representation?
- 1) indifference to word order
- 2) inability to represent idiomatic phrases
    - ex) airport “Air Canada” \(\neq\) Air + Canada
 
1. Introduction
Distributed respresentation of words, using Skip-gram Model
Skip-gram model
- Efficient estimation of word representations in vector space. (Mikolov et al, 2013)
- unlike NN, no dense matrix multiplication \(\rightarrow\) efficient
- can encode its meaning!
    - ex) King - Man + Woman = Queen
 
This paper introduces extension of original Skip-gram model.
- 1) Subsampling of frequent words
    - speed up (2x ~ 10x)
- improves accuracy of less frequent words
 
- 
    2) variant of NCE (Noise Contrastive Estimation) - faster training
- better vector representations
 ( compared to Hierarchical Softmax ) 
Limitation of word representations : “Inability to represent idiomatic phrases”
- ex) airport “Air Canada” \(\neq\) Air + Canada
- 
    using vectors to represent the “whole phrase” ( “Air Canada” as one phrase ) 
- How to extend from “word-based” to “phrase-based” model? SIMPLE!
    - 1) identify a large number of phrases
- 2) treat the phrases as individual tokens
 
2. The Skip-gram Model

CBOW vs Skip-gram
- 
    CBOW : predicting the current word based on the context 
- 
    Skip-gram : predicting words within a certain range before and after the current word, given current word. 
Training Skip-gram
- 
    How? “Maximize the average log probability” \(\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p\left(w_{t+j} \mid w_{t}\right)\). - \(c\) : size of the training context
- Larger \(c\) ( = more training examples ) \(\rightarrow\) higher accuracy, slower speed
 
- 
    softmax function \(p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\). - \(v_w\) : input
- \(v_w'\) : output
- \(W\) : number of words (vocab)
 \(\rightarrow\) Impractical ! Too many words! ( \(10^5 \sim 1-^7\) terms) \(\nabla \log p\left(w_{O} \mid w_{I}\right)\) is proportional to \(W\) 
Inefficient of using all the \(W\) words?
- 
    softmax function : \(\hat{y}_{i}=P(i \mid c)=\frac{\exp \left(u_{i}^{T} v_{c}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)}\). where \(u_i\) and \(v_j\) are the column vectors of embedded matrix ( let \(U=\left[u_{1}, u_{2}, \ldots, u_{k}, \ldots u_{W}\right]\) be a matrix composed of \(u_{k}\) column vectors ) 
- 
    loss : Cross Entropy Loss : \(J=-\sum_{i=1}^{W} y_{i} \log \left(\hat{y}_{i}\right)\). where \(y\) : one-hot encoded vector & \(\hat{y}\) : softmax prediction \(\begin{aligned}J&=-\sum_{i=1}^{W} y_{i} \log \left(\frac{\exp \left(u_{i}^{T} v_{c}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)}\right)\\ &=-\sum_{i=1}^{W} y_{i}\left[u_{i}^{T} v_{c}-\log \left(\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)\right)\right]\\ &=-y_{k}\left[u_{k}^{T} v_{c}-\log \left(\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)\right)\right]\end{aligned}\). Thus… \(\begin{aligned}\frac{\partial J}{\partial v_{c}}&=-\left[u_{k}-\frac{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right) u_{w}}{\sum_{x=1}^{W} \exp \left(u_{x}^{T} v_{c}\right)}\right]\\ &=\sum_{w=1}^{W}\left(\frac{\exp \left(u_{w}^{T} v_{c}\right)}{\sum_{x=1}^{W} \exp \left(u_{x}^{T} v_{c}\right)} u_{w}\right)-u_{k}\\ &=\sum_{w=1}^{W}\left(\hat{y}_{w} u_{w}\right)-u_{k}\end{aligned}\). How to solve this? - 1) Hierarchical Softmax
- 2) Negative Sampling
 
2-1. Hierarchical Softmax
( details : refer to https://seunghan96.github.io/ne/03.Hierarchical_Softmax/ )
Key points
- uses binary tree representation
- instead of evaluating \(W\) output nodes, only need to evaluate \(\log_2(W)\) nodes
- define a random walk that assings probabilities to words
2-2. Negative Sampling
( details : refer to https://seunghan96.github.io/ne/04.Negative_Sampling/ )
Key points
- 
    alternative to hierarchical softmax : NCE (Noise Contrastive Estimation) - good model = able to differentiat data from noise, using log reg
- simplified NCE = Negative Sampling
 
- 
    Negative sampling : use \(k\) negative samples ( = wrong answers ) 
- 
    (Standard) Objective function \(p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\). \(\log p\left(w_{O} \mid w_{I}\right)=\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right) - \log \sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)\). 
- 
    (Proposed) instead of \(\log p\left(w_{O} \mid w_{I}\right)\)… \(\log \sigma\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)+\sum_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_{n}(w)}\left[\log \sigma\left(-v_{w_{i}}^{\prime}{ }^{\top} v_{w_{I}}\right)\right]\). 
- 
    Distinguish target word \(w_O\) from draws from the noise distn \(P_n(w)\) , using logistic regression (sigmoid func) 
2-3. Subsampling of Frequent Words
most frequent words : occur hundreds of millions of times!
do not give even probability of getting sampled!
- 
    More Frequently occured, Less Probability of getting sampled 
- 
    \(P\left(w_{i}\right)=1-\sqrt{\frac{t}{f\left(w_{i}\right)}}\). where \(f(w_i)\) : frequency of word \(i\) & \(t\) : chosen threshold 
