What’s up with word embedding?

Word embedding is one of the interesting areas of research in Natural Language Processing. There are huge amount of materials with a lot of interesting ideas. I have been studying some of them lately and in this post, I’d like to create a brief account of the ideas I have found most interesting so far.

Word2Vec (Yet another explanation)


Let’s start with the intuitive distributional hypothesis

A word is characterized by the company it keeps or linguistic items with similar distributions have similar meanings.

In other words, we expect to see words used in similar contexts have similar meaning (semantic relations).

The goal is to build a mathematical model such that given a word (token) w, we can assign real-valued vector v_w (in some space), so that interesting properties of words such as semantic relations, are preserved  for example, by using their inner product or cosine similarity of vectors as metric.

Mikolov et. al. constructed such vector representations of words from natural language text that preserve semantic relations via simple quantities of vectors such as their inner product. Their set of models is called Word2Vec. Intuitively, there are two basic ideas; for a series of words (tokens) w_1, w_2, \cdots, w_T in a corpus/Text data, and a choice of context C i.e. a window (set) of words, either find the probability of some word w_i given the some word in the context c \in C that is, p(w_i | c) or find the probability of observing a context given a word w_i, that is, p(c|w_i). The first model is called Continuous Bag of Words (CBOW) and the second model is called SkipGram. (If you’ve never heard of these terms before, I suggest reading this note.)

One of well-known tasks that Word2Vec performed well is the analogy task. Word2Vec captures the following type of relations between words:

Formally, let v_{\text{w}} be a vector representation (embedding) of a word w in some Euclidean vector space. Then

 v_{\text{man}} - v_{\text{woman}} \simeq v_{\text{king}} - v_{\text{queen}}

and one such similarity measure \simeq can be cosine similarity of vectors, for example.

SkipGram model

Given the notations above and the conditional probabilities p(c|w), the goal is to find some parameters \theta of the parametrized p(c|w; \theta) in order to maximize the corpus probability, i.e.

\text{argmax}_{\theta} \prod_{w \in W} \left(\prod_{c \in C(w)} p(c | w; \theta) \right) \;\;\;\; (\star)

There is an independent assumption (reflected in \prod_{c \in C(w)} p(c | w; \theta) ) that given a word then observing different words in a context are independent from each other and those events for each word themselves are independent (reflected in the outer \prod_{w \in \text{W}}). Note also that contexts are words and SkipGram models each context given a word independently.

One approach to model the conditional probabilities p(c | w; \theta) so as to connect the word-vector representation idea, is via softmax function

p(c | w; \theta) = \dfrac{\exp(\langle v_c, v_w \rangle)}{\sum_{c' \in C(w)}\exp(\langle v_c', v_w \rangle)} \;\;\;\; (\star \star)

where v_c, v_w are the desired vector representations for c, w, and \langle v_c, v_w \rangle is the (Euclidean) inner product. Note that \theta is the set of all v_w, v_c for all w \in W and c \in C, so there are |W| \times |C| \times d number of parameters where d is the embedding dimension.

This representation can be considered as a shallow neural network with softmax output.

To address some of the difficulties on the training such as finding the denominator in the softmax there’re two proposed solutions; 1) hierarchical softmax 2) negative sampling.  In practice negative sampling is more favorable. To get to know more I recommend reading Word2Vec explained paper and more expanded version in Word2Vec parameter learning explained.

Point-wise mutual information matrix (PMI)

Recall that if two random outcomes x, y are independent, then p(x,y) = p(x) p(y). That is, their join distribution factorizes into their individual distributions. To have a general measure of this phenomenon, given any two (not necessarily independent) random outcomes, we can define their Point-wise Mutual Information as \text{pmi}(x, y) = \log \left(\dfrac{p(x,y)}{p(x)p(y)} \right). In case of word-context pairs, we can define the PMI matrix whose entries are \text{pmi}(w, c) which is |W| \times |C| matrix.

One can use Singular Value Decomposition on the PMI matrix to get lower dimensional representations of words. Let U \Sigma V^T be the SVD of the PMI matrix, then for example, the symmetric factorization W = U \Sigma^{\frac 12} and C = V \Sigma^{\frac 12} provide word and context representations, respectively. However, SVD provides the best rank d approximation wrt L_2 matrix norm, and in practice, this is not enough!

What’s the relation between SkipGram and PMI?

SkipGram with negative sampling (SGNS) vs. PMI

Levy-Goldberg showed that if you take the word embedding W and the context embedding C obtained in SGNS, then WC^T is in fact, a factorization of (shifted) PMI matrix. In other words, \langle v_w, v_c \rangle  \simeq \text{PMI}(w, c) - \log k where k is the number of negative samples. This result bridges the neural network method with the traditional vector space model of semantics. Another important advantage is that PMI approach suffers from scalability but SGNS is very scalable, indeed.

Variants of SGNS

i) NCE:

Minh et. al used noise contrastive estimation (NCE) to model the probability of a word-context coming from correct sample vs. incorrect one. Levy-Goldberg also showed that this approach is equivalent to factorizing the word-context matrix whose entries are shifted \log p(w|c).

ii) GloVe:

Another approach is described in GloVe: Global vectors for word representations .  The idea is to related inner product of desired vectors to the ratio of the context-word conditional probabilities. The optimization function that is

\sum_{i,j} f(X_{ij}) (v_{w_i}^T v_{c_j} - \log X_{ij} + b_{w} + b_c)

where X_{ij} is the element (i, j) of the matrix of word-word co-occurence count X, b_w, b_c are biases and f is some function (found empirically) with some desired properties. Despite GloVe gives more parameters to the model, its performance is quite similar to SGNS. In fact, by fixing the biases to be the logarithm of word and context counts, GloVe also factorizes a shifted version of PMI matrix.

SGNS as weighted logistic PCA and its generalization

Levy-Goldberg have also provided a much more clear description of the SGNS model, which briefly goes like this:

After taking \log from (\star) and using the softmax (\star \star) it  becomes equivalent to

\text{argmax}_{\theta} \sum_{(w, c) \in D} \log p(c | w; \theta) = \sum_{(w, c) \in D} (v_c^T v_w - \log \sum_{c'} \exp(v_{c'}^Tv_w))

where D is the set of all word-context pairs from \text{Text}.

The role of Negative Sampling is to approximation the softmax log of probabilities. Basically, to approximate the above objective, we change it to a classification task of p(D=1|w,c) = \dfrac{1}{1 + \exp(v_c^Tv_w)}, which is given a word-context (w, c) whether it is coming from our \text{Text} or not p(D=0 | w,c). So we need to gather some noise word-contexts D'. Mikolov et. al, did it by randomly sampling from smoothed unigram distribution (i.e. to power 0.75), k context noises c_{\text{Noise}} (in short c_N) for each word w.

Note that without negative samples, the above objective can be maximized when all word and context vectors become equal  v_w = v_c with big enough inner product. Therefore, with negative sampling, the approximation goes like this;

\text{argmax}_{\theta} \prod_{(w, c) \in D} p(D=1|c,w;\theta)\prod_{(w, c) \in D'} p(D=0|c,w ; \theta)

After some standard simplifications (see Word2Vec Explained),  the above objective becomes

\text{argmax}_{\theta} \sum_{(w,c) \in D} \log \sigma(v_c^Tv_w) + \sum_{(w, c) \in D'} \log \sigma (-v_c^Tv_w) \;\;\;\; (\star \star \star)

where \sigma is the Sigmoid function.

In the presence of negative samples c_N, for each (w, c) \in D, we are computing

\log \sigma(v_c^Tv_w) + \sum_{(w, c) \in D'} \log \sigma (-v_c^Tv_w) =\log \sigma(v_c^Tv_w) + k \mathbb{E}_{c_N} [\log \sigma (-v_c^Tv_w)]

For a given (w, c) let n_{w,c} be the number of times they appear in D. (this about D like matrix, n_{w,c} is the entry in row w and column c). So the SGNS objective (equation (\star \star \star)) summed with multiplicities becomes

\sum_{(w, c)} n_{w, c} (\log \sigma(v_c^Tv_w) + k \mathbb{E}_{c_N} [\log \sigma (-v_c^Tv_w)])

Landgrad-Bellay has recently provided another interpretation of the above SGNS objective and they showed that it is equivalent to the weighted logistic PCA. The generalization of this fact is captured through exponential family PCA.

Enriching the word embedding

Another interesting idea is to do with the recent work of Avraham-Goldberg to include morphological information of words with Part of Speech (POS) tagging with preprocessing of the text and consider the pair (w, \text{POS}) instead of the word w alone. The result is having different vector representations for cases like \text{plant}_{\text{NOUN}} and \text{plant}_{\text{VERB}}.

Geometry of the word embedding

To understand geometric structures of data, one can look into Topological Data Analysis and its methods such as Persistent Homology. Zhu has an introduction of such approach for Natural Language Processing. In basic algebraic topology (with enough assumption), the dimension of zeroth homology group (zeroth Betti number) of a topological space is the number of connected components and its first Betti number counts the number of holes. Given some data points, the idea of persistent homology is to track homological classes along increasing neighborhoods of (simplicial complexes) data points.

Recently Michel et. al using persistent homological approach concluded that such method doesn’t have positive impact on document classification and clustering tasks. They have used Gromov-Haussdorff distance (which is insensitive to isometries) and defined two documents have the same “geometry” if their GH distance if zero. However, it could be argued that this definition of “geometry” is very limiting and doesn’t capture all existing structures in document data!

Generalization to graph and manifold embedding

Arora et. al. in their work, RAND-WALK: A latent variable model approach to word embeddings with generative modelling approach, provided more justifications for relations between PMI, Word2Vec and GloVe.

Hashimoto-Alvarez-Melis considered the task of word embedding as metric recovery. That is, given a word embedding v_{w_1}, \cdots, v_{w_m} over a document with s sentences with total number m words and n vocabulary (unique words), one can view p(c_j|w_i) (where context is a word as well and there’s no separate context embedding vectors, such as SGNS throwing away the context vectors) as a Markov (Gaussian) random walk X_1, \cdots, X_m with transition function

p(X_t = v_{w_j} | X_{t-1}=v_{w_i}) = \dfrac{\exp(- \|v_{w_i} - v_{w_j} \|_2^2)}{\sum_{k=1}^n \exp(- \|v_{w_i} - v_{w_k} \|_2^2)}

then there exists a sequence a_i^m, b_j^m such that in probability, as m \to \infty,

-\log(C_{ij}) - a_i^m \stackrel{p}{\longrightarrow} \|v_{w_i} - v_{w_j} \|_2^2 + b_j^m

where C = [C_{ij}] is the co-occurence matrix over our document. (matrix version of earlier D with contexts as words). This described log-linear relation between co-occurences and distance.

The metric recovery holds in more general setting for random walk over unweighted directed graphs and data manifold. Intuitively,

-\log(\text{co-occurence}) \stackrel{\text{converges}}{\longrightarrow} \text{geodesic}(v_{w_i}, v_{w_j})^2

for some meaning of convergence and distance/geodesic.

Hyperbolic embedding

We can view words as symbolic data and try to represent their relations with graphs. To learn a representation of symbolic data with hierarchical relations (with existence of power-law distribution), one well-known approach is embedding the data in a non-Euclidean space, such as d-dimensional Poincaré ball \mathbb{B}^d = \{x \in \mathbb{R}^d | \|x\|_2 < 1\} (or complex upper half-plane \mathbb{H}) which is the Euclidean d dimensional ball equipped with (non-Euclidean) Riemannian metric. In two closely published papers Chamberlain et. al studied hyperbolic embedding for 2-dimensional Poincaré ball and Nickel-Kiela for a general d-dimensional Poincaré ball. They examined such embeddings for different types of symbolic data such as text and network graphs and showed improvements as well as the ability of capturing more information in lower dimension because hyperbolic spaces are richer than flat Euclidean spaces.


Another line of ideas is related to encode-decoder (seq2seq) approach where either encoder or decoder can be any of Convolutional NN or Recurrent NN such as LSTM or GRU. The basic idea is to encode a sequence of data (sentences) and try to reconstruct them back with decoder. In the meantime, compact representations are constructed. One such successful approach has been introduced by Kiros et. al. in their Skip-Thought Vectors with GRU as encoder and decoder. Given a tuple of sentences (s_{i-1}, s_i, s_{i+1}), let w_i^t be the t-th word for sentence s_i and let v_{w_i^t} be its embedding.  The objective is to maximize the log-probabilities for the forward and backward sentences conditioned on the encoder representation:

\sum_t \log p(w_{i+1}^t | w_{i+1}^{< t}, h_i) +\sum_t \log p(w_{i-1}^t | w_{i-1}^{< t}, h_i)

where w_{i+1}^{< t} is the sequence of words in sentence s_i coming before the t-th words and h_i is the hidden state of the encoder GRU.


Thanks for reading this post!  I’m still learning and will try to update this post if I find more interesting ideas. If you have any thoughts, please comment below.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About Me

I’m a Machine Learning Engineer. If you want to know more about me and my work, please check out my GitHub and Linkedin.


%d bloggers like this: