2.Distributed Representations of Words and Phrases and their Compositionality (2013)

목차

  1. Abstract

  2. Introduction

  3. Skip-gram model

    1. Hierarchical Softmax
    2. Negative Sampling
    3. Subsampling of Frequent words


Abstract

Skip-gram model :

  • (1) “efficient” method for learning (2) “distributed vector representation”

  • Not only just the co-occurence of the word

    But also captures its meaning!

This paper suggests extensions that improves both

  • 1) “quality of the vectors”
  • 2) “training speed”

How? By..

  • 1) SUBSAMPLING of frequent words
  • 2) NEGATIVE SAMPLING ( alternative to hierarchical softmax )


Limitation of word representation?

  • 1) indifference to word order
  • 2) inability to represent idiomatic phrases
    • ex) airport “Air Canada” \(\neq\) Air + Canada


1. Introduction

Distributed respresentation of words, using Skip-gram Model

Skip-gram model

  • Efficient estimation of word representations in vector space. (Mikolov et al, 2013)
  • unlike NN, no dense matrix multiplication \(\rightarrow\) efficient
  • can encode its meaning!
    • ex) King - Man + Woman = Queen


This paper introduces extension of original Skip-gram model.

  • 1) Subsampling of frequent words
    • speed up (2x ~ 10x)
    • improves accuracy of less frequent words
  • 2) variant of NCE (Noise Contrastive Estimation)

    • faster training
    • better vector representations

    ( compared to Hierarchical Softmax )


Limitation of word representations : “Inability to represent idiomatic phrases”

  • ex) airport “Air Canada” \(\neq\) Air + Canada
  • using vectors to represent the “whole phrase” ( “Air Canada” as one phrase )

  • How to extend from “word-based” to “phrase-based” model? SIMPLE!
    • 1) identify a large number of phrases
    • 2) treat the phrases as individual tokens


2. The Skip-gram Model

figure2

CBOW vs Skip-gram

  • CBOW : predicting the current word based on the context

  • Skip-gram : predicting words within a certain range before and after the current word, given current word.

Training Skip-gram

  • How? “Maximize the average log probability”

    \(\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p\left(w_{t+j} \mid w_{t}\right)\).

    • \(c\) : size of the training context
    • Larger \(c\) ( = more training examples ) \(\rightarrow\) higher accuracy, slower speed
  • softmax function

    \(p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\).

    • \(v_w\) : input
    • \(v_w'\) : output
    • \(W\) : number of words (vocab)

    \(\rightarrow\) Impractical ! Too many words! ( \(10^5 \sim 1-^7\) terms)

    \(\nabla \log p\left(w_{O} \mid w_{I}\right)\) is proportional to \(W\)

Inefficient of using all the \(W\) words?

  • softmax function : \(\hat{y}_{i}=P(i \mid c)=\frac{\exp \left(u_{i}^{T} v_{c}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)}\).

    where \(u_i\) and \(v_j\)​ are the column vectors of embedded matrix

    ( let \(U=\left[u_{1}, u_{2}, \ldots, u_{k}, \ldots u_{W}\right]\) be a matrix composed of \(u_{k}\) column vectors )

  • loss : Cross Entropy Loss : \(J=-\sum_{i=1}^{W} y_{i} \log \left(\hat{y}_{i}\right)\).

    where \(y\) : one-hot encoded vector & \(\hat{y}\) : softmax prediction


    \(\begin{aligned}J&=-\sum_{i=1}^{W} y_{i} \log \left(\frac{\exp \left(u_{i}^{T} v_{c}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)}\right)\\ &=-\sum_{i=1}^{W} y_{i}\left[u_{i}^{T} v_{c}-\log \left(\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)\right)\right]\\ &=-y_{k}\left[u_{k}^{T} v_{c}-\log \left(\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right)\right)\right]\end{aligned}\).


    Thus…

    \(\begin{aligned}\frac{\partial J}{\partial v_{c}}&=-\left[u_{k}-\frac{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{c}\right) u_{w}}{\sum_{x=1}^{W} \exp \left(u_{x}^{T} v_{c}\right)}\right]\\ &=\sum_{w=1}^{W}\left(\frac{\exp \left(u_{w}^{T} v_{c}\right)}{\sum_{x=1}^{W} \exp \left(u_{x}^{T} v_{c}\right)} u_{w}\right)-u_{k}\\ &=\sum_{w=1}^{W}\left(\hat{y}_{w} u_{w}\right)-u_{k}\end{aligned}\).


    How to solve this?

    • 1) Hierarchical Softmax
    • 2) Negative Sampling


2-1. Hierarchical Softmax

( details : refer to https://seunghan96.github.io/ne/03.Hierarchical_Softmax/ )

Key points

  • uses binary tree representation
  • instead of evaluating \(W\) output nodes, only need to evaluate \(\log_2(W)\) nodes
  • define a random walk that assings probabilities to words


2-2. Negative Sampling

( details : refer to https://seunghan96.github.io/ne/04.Negative_Sampling/ )

Key points

  • alternative to hierarchical softmax : NCE (Noise Contrastive Estimation)

    • good model = able to differentiat data from noise, using log reg
    • simplified NCE = Negative Sampling
  • Negative sampling : use \(k\) negative samples ( = wrong answers )

  • (Standard) Objective function

    \(p\left(w_{O} \mid w_{I}\right)=\frac{\exp \left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)}{\sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)}\).

    \(\log p\left(w_{O} \mid w_{I}\right)=\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right) - \log \sum_{w=1}^{W} \exp \left(v_{w}^{\prime}{ }^{\top} v_{w_{I}}\right)\).


  • (Proposed) instead of \(\log p\left(w_{O} \mid w_{I}\right)\)…

    \(\log \sigma\left(v_{w_{O}}^{\prime}{ }^{\top} v_{w_{I}}\right)+\sum_{i=1}^{k} \mathbb{E}_{w_{i} \sim P_{n}(w)}\left[\log \sigma\left(-v_{w_{i}}^{\prime}{ }^{\top} v_{w_{I}}\right)\right]\).

  • Distinguish target word \(w_O\) from draws from the noise distn \(P_n(w)\) ,

    using logistic regression (sigmoid func)


2-3. Subsampling of Frequent Words

most frequent words : occur hundreds of millions of times!

do not give even probability of getting sampled!

  • More Frequently occured, Less Probability of getting sampled

  • \(P\left(w_{i}\right)=1-\sqrt{\frac{t}{f\left(w_{i}\right)}}\).

    where \(f(w_i)\) : frequency of word \(i\) & \(t\) : chosen threshold