1.Efficient Estimation of Word Representations in Vector Space (2013)
목차
- Abstract
- Introduction
-
Model Architectures
- NNLM ( Feedforward NNLM )
- RNNLM ( Recurrent NNLM )
- New Log-Linear Models
- CBOW
- Skip-gram
Abstract
Word2Vec : (1) Skip-gram & (2) CBOW
- able to capture the similarity of words
- improvement in accuracy & speed
1. Introduction
Previous models : N-gram … similarity (X) & use indices (O)
Proposes “distributed representations of words”
( \(\leftrightarrow\) previous ones were “SPARSE” representation….one-hot encoded vector )
2. Model Architectures
This paper focuses on “distributed representations” of words, learned by NN
Training Complexity : \(O=E \times T \times Q\)
- \(E\) : training epoch
- \(T\) : number of words
- \(Q\) : (defined further for each model architecture)
use SGD & BP
2-1. NNLM (Feedforward NNLM)
consists of 1) input, 2) projection, 3) hidden, 4) output layers
1) Input layer
- \(N\) words are one-hot encoded ( = 1-of-V coding )
2) Projection layer
- dimension : \(N \times D\)
- shared projection matrix
3) Hidden layer
-
projection-hidden layer : complex computation
( since values in projection layer are dense )
-
compute “probability distribution” over all the words in vocabulary
4) Output layer
- dimension : \(V\)
Time complexity : \(Q=(N\times D) + (N\times D \times H) + (H \times V)\)
-
last term \((H \times V)\) is the dominating term
\(\rightarrow\) use Negative Sampling, Hierarchical softmax to reduce this!
\(\rightarrow\) from \(V\) to \(\log_2V\)
2-2. RNNLM (Recurrent NNLM)
overcome limitation of NNLM ( = need to specify context length, N )
RNN :
-
input, hidden, output layer (no projection layer)
-
information from past can be represented by “hidden layer state” \(h_t\)
( updated by \(x_t\) and \(h_{t-1}\) )
Time complexity : \(Q=(H\times H) + (H \times V)\)
-
last term \((H \times V)\) is the dominating term
\(\rightarrow\) use Negative Sampling, Hierarchical softmax to reduce this!
\(\rightarrow\) from \(V\) to \(\log_2V\)
3. New Log-Linear Models
propose 2 models, (1) CBOW & (2) Skip-gram
3-1. CBOW
\(Q = (N \times D) + (D \times \log_2(V))\).
3-2. Skip-gram
\(Q = C \times (D + D \times \log_2(V))\).