Word embedding Projector

From Algolit

Type: Algoliterary exploration
Dataset(s): Glove
Technique: word embeddings
Developed by: Google Tensorflow

The projector from Google Tensorflow-package allows to visualize a multidimensional space by projecting it into a 2 or 3 dimensional space. This allows us to peek into the wordspace formed by the word embeddings from the datasets we use (in this example the glove.42B dataset). The projection shows not the whole dataset, but a selection of 10000 words (or less).

Such large-dimensional spaces are for a human impossible to perceive visually. Some mathematical techniques exists to make specific projections of such a space into lower-dimensional spaces (in analogy to the use of perspective to visualize a 3 dimensional space on a 2-dimensional space or a plane).

The Tensorflow projector uses Principal Component Analysis (PCA) to create a projection into the 2 or 3 dimensions in which the greatest variance of the dataset can be expressed. PCA does not change the word embeddings but only changes the point of view by rotating the axes of the space to make sure that the first dimensions show the largest variance (= the largest differences between the words). Next these first 2 or 3 dimensions are shown on the screen. On the left panel it is indicated how much of the variance is expressed in this projection.

The Tensorflow projector also provides a t-SNE projection. t-distributed stochastic neighbor embedding (t-SNE) does not show the original wordspace, but shows a probability distribution in 2 or 3 dimensions of words being similar or not. Words being similar, or near each other in the word embedding space, will be shown near each other in the projection, while words which are dissimilar are shown far apart from each other. In other words, the t-SNE projection tries to preserve the relative distances between the words in the 300-dimensional word embedding space in the 2 or 3D projection.

Both projections give us a peek into what language means when it is perceived by the computer through algorithms creating word embeddings (like Glove or word2vec). (Dis)similarity in words is expressed by the distance of the words. Associations between words present in the original texts by co-occurences of words will be reflected in the distances in the word embedding space. They can be explored visually through these projections, or mathematically by calculation the distances in the word embedding space.