Bengio, et al., 2003

Summary

  • associate with each word in the vocabulary a distributed word feature vector (real valued vector in $\mathbb{R}^n$)
  • express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence
  • learn simultaneously the word feature vectors and the parameters of that probability function

For discrete random variables, learning a joint probability distribution is hard because a small change in one of the variables could cause a large change in the value of the function to be estimated. Instead, transforming the discrete random variables into a vector space in $\mathbb{R}^n$ allows the use of neural nets or GMMs which are smooth approximators. Additionally, the notion of a “nearby” word within the continuous vector space representation is now defined more clearly.

Words are mapped into a matrix $C$ of size ($|V| \times m$) for a vocabulary size $|V|$ and embedding dim $m$. The feature vectors (columns of $C$) are learned simultanesouly with the parameters of the neural network. The input to the time-lagged neural network (RNN) is the concatenated vector of word representations. The objective is to maximize the log-likehood of a given sequence of out-of-sample words. This essentially is the encoder in Neural Machine Translation.

Things of interest

  • The curse of dimensionality in modeling joint probability of sequences of words in a language is a major stumbling block
  • One can reduce the difficulty by using the fact that temporally closer words in the word sequence are statistically more dependent $\rightarrow$ n-gram models
  • It was noted by the authors that n-gram models and the neural models made different “errors”, so an averaging model of the two showed improvements overall
  • It is suggested by the authors as well to train multiple smaller networks on partitions of the training data to speed things up
  • Learning word embeddings is very parallelizable- specifically, the computation of the output layer of the neural model was found to take up roughly 99.7% of the computation (since you’re computing a likelihood of a sequence out of the entire vocabulary)