Word embedding is one of the interesting areas of research in Natural Language Processing. There are huge amount of materials with a lot of interesting ideas. I have been studying some of them lately and in this post, I’d like to create a brief account of the ideas I have found most interesting so far.
Word2Vec (Yet another explanation)
Let’s start with the intuitive distributional hypothesis
A word is characterized by the company it keeps or linguistic items with similar distributions have similar meanings.
In other words, we expect to see words used in similar contexts have similar meaning (semantic relations).
The goal is to build a mathematical model such that given a word (token) we can assign real-valued vector (in some space), so that interesting properties of words such as semantic relations, are preserved for example, by using their inner product or cosine similarity of vectors as metric.
Mikolov et. al. constructed such vector representations of words from natural language text that preserve semantic relations via simple quantities of vectors such as their inner product. Their set of models is called Word2Vec. Intuitively, there are two basic ideas; for a series of words (tokens) in a corpus/Text data, and a choice of context i.e. a window (set) of words, either find the probability of some word given the some word in the context that is, or find the probability of observing a context given a word that is, The first model is called Continuous Bag of Words (CBOW) and the second model is called SkipGram. (If you’ve never heard of these terms before, I suggest reading this note.)
One of well-known tasks that Word2Vec performed well is the analogy task. Word2Vec captures the following type of relations between words:
Formally, let be a vector representation (embedding) of a word in some Euclidean vector space. Then
and one such similarity measure can be cosine similarity of vectors, for example.
Given the notations above and the conditional probabilities the goal is to find some parameters of the parametrized in order to maximize the corpus probability, i.e.
There is an independent assumption (reflected in ) that given a word then observing different words in a context are independent from each other and those events for each word themselves are independent (reflected in the outer ). Note also that contexts are words and SkipGram models each context given a word independently.
One approach to model the conditional probabilities so as to connect the word-vector representation idea, is via softmax function
where are the desired vector representations for and is the (Euclidean) inner product. Note that is the set of all for all and so there are number of parameters where is the embedding dimension.
This representation can be considered as a shallow neural network with softmax output.
To address some of the difficulties on the training such as finding the denominator in the softmax there’re two proposed solutions; 1) hierarchical softmax 2) negative sampling. In practice negative sampling is more favorable. To get to know more I recommend reading Word2Vec explained paper and more expanded version in Word2Vec parameter learning explained.
Point-wise mutual information matrix (PMI)
Recall that if two random outcomes are independent, then That is, their join distribution factorizes into their individual distributions. To have a general measure of this phenomenon, given any two (not necessarily independent) random outcomes, we can define their Point-wise Mutual Information as In case of word-context pairs, we can define the PMI matrix whose entries are which is matrix.
One can use Singular Value Decomposition on the PMI matrix to get lower dimensional representations of words. Let be the SVD of the PMI matrix, then for example, the symmetric factorization and provide word and context representations, respectively. However, SVD provides the best rank approximation wrt matrix norm, and in practice, this is not enough!
What’s the relation between SkipGram and PMI?
SkipGram with negative sampling (SGNS) vs. PMI
Levy-Goldberg showed that if you take the word embedding and the context embedding obtained in SGNS, then is in fact, a factorization of (shifted) PMI matrix. In other words, where is the number of negative samples. This result bridges the neural network method with the traditional vector space model of semantics. Another important advantage is that PMI approach suffers from scalability but SGNS is very scalable, indeed.
Variants of SGNS
Minh et. al used noise contrastive estimation (NCE) to model the probability of a word-context coming from correct sample vs. incorrect one. Levy-Goldberg also showed that this approach is equivalent to factorizing the word-context matrix whose entries are shifted
Another approach is described in GloVe: Global vectors for word representations . The idea is to related inner product of desired vectors to the ratio of the context-word conditional probabilities. The optimization function that is
where is the element of the matrix of word-word co-occurence count are biases and is some function (found empirically) with some desired properties. Despite GloVe gives more parameters to the model, its performance is quite similar to SGNS. In fact, by fixing the biases to be the logarithm of word and context counts, GloVe also factorizes a shifted version of PMI matrix.
SGNS as weighted logistic PCA and its generalization
Levy-Goldberg have also provided a much more clear description of the SGNS model, which briefly goes like this:
After taking from and using the softmax it becomes equivalent to
where is the set of all word-context pairs from
The role of Negative Sampling is to approximation the softmax log of probabilities. Basically, to approximate the above objective, we change it to a classification task of which is given a word-context whether it is coming from our or not So we need to gather some noise word-contexts . Mikolov et. al, did it by randomly sampling from smoothed unigram distribution (i.e. to power ), context noises (in short ) for each word
Note that without negative samples, the above objective can be maximized when all word and context vectors become equal with big enough inner product. Therefore, with negative sampling, the approximation goes like this;
After some standard simplifications (see Word2Vec Explained), the above objective becomes
where is the Sigmoid function.
In the presence of negative samples , for each we are computing
For a given let be the number of times they appear in (this about like matrix, is the entry in row and column ). So the SGNS objective (equation ) summed with multiplicities becomes
Landgrad-Bellay has recently provided another interpretation of the above SGNS objective and they showed that it is equivalent to the weighted logistic PCA. The generalization of this fact is captured through exponential family PCA.
Enriching the word embedding
Another interesting idea is to do with the recent work of Avraham-Goldberg to include morphological information of words with Part of Speech (POS) tagging with preprocessing of the text and consider the pair instead of the word alone. The result is having different vector representations for cases like and
Geometry of the word embedding
To understand geometric structures of data, one can look into Topological Data Analysis and its methods such as Persistent Homology. Zhu has an introduction of such approach for Natural Language Processing. In basic algebraic topology (with enough assumption), the dimension of zeroth homology group (zeroth Betti number) of a topological space is the number of connected components and its first Betti number counts the number of holes. Given some data points, the idea of persistent homology is to track homological classes along increasing neighborhoods of (simplicial complexes) data points.
Recently Michel et. al using persistent homological approach concluded that such method doesn’t have positive impact on document classification and clustering tasks. They have used Gromov-Haussdorff distance (which is insensitive to isometries) and defined two documents have the same “geometry” if their GH distance if zero. However, it could be argued that this definition of “geometry” is very limiting and doesn’t capture all existing structures in document data!
Generalization to graph and manifold embedding
Arora et. al. in their work, RAND-WALK: A latent variable model approach to word embeddings with generative modelling approach, provided more justifications for relations between PMI, Word2Vec and GloVe.
Hashimoto-Alvarez-Melis considered the task of word embedding as metric recovery. That is, given a word embedding over a document with sentences with total number words and vocabulary (unique words), one can view (where context is a word as well and there’s no separate context embedding vectors, such as SGNS throwing away the context vectors) as a Markov (Gaussian) random walk with transition function
then there exists a sequence such that in probability, as
where is the co-occurence matrix over our document. (matrix version of earlier with contexts as words). This described log-linear relation between co-occurences and distance.
The metric recovery holds in more general setting for random walk over unweighted directed graphs and data manifold. Intuitively,
for some meaning of convergence and distance/geodesic.
We can view words as symbolic data and try to represent their relations with graphs. To learn a representation of symbolic data with hierarchical relations (with existence of power-law distribution), one well-known approach is embedding the data in a non-Euclidean space, such as -dimensional Poincaré ball (or complex upper half-plane ) which is the Euclidean dimensional ball equipped with (non-Euclidean) Riemannian metric. In two closely published papers Chamberlain et. al studied hyperbolic embedding for -dimensional Poincaré ball and Nickel-Kiela for a general -dimensional Poincaré ball. They examined such embeddings for different types of symbolic data such as text and network graphs and showed improvements as well as the ability of capturing more information in lower dimension because hyperbolic spaces are richer than flat Euclidean spaces.
Another line of ideas is related to encode-decoder (seq2seq) approach where either encoder or decoder can be any of Convolutional NN or Recurrent NN such as LSTM or GRU. The basic idea is to encode a sequence of data (sentences) and try to reconstruct them back with decoder. In the meantime, compact representations are constructed. One such successful approach has been introduced by Kiros et. al. in their Skip-Thought Vectors with GRU as encoder and decoder. Given a tuple of sentences let be the -th word for sentence and let be its embedding. The objective is to maximize the log-probabilities for the forward and backward sentences conditioned on the encoder representation:
where is the sequence of words in sentence coming before the -th words and is the hidden state of the encoder GRU.
Thanks for reading this post! I’m still learning and will try to update this post if I find more interesting ideas. If you have any thoughts, please comment below.