Embeddings are a cornerstone of natural language processing (NLP), allowing words, sentences or even documents to be represented as vectors in a continuous space. Their development is based on the distributional hypothesis, which states that “the meaning of a word is found in its use”. This idea has given rise to different vector representation methods, ranging from classical approaches like TF-IDF and PMI to advanced models such as Word2Vec, GloVe and BERT.
Generally speaking, thinking about embeddings is like thinking in a matrix way. Of course, an embedding is a vector, but it is first and foremost part of a matrix vision where a vector is a “slice” of a matrix, imposing a fixed size to the dimensions.
The distributional hypothesis
The distributional hypothesis, proposed by Zellig Harris in 1954, states that “words that appear in similar contexts have similar meanings.” This idea is crucial for the construction of embeddings, as it implies that we can infer the meaning of words by observing the contexts in which they appear. The idea is found in a quote from Firth a few years later: “You shall know a word by the company it keeps” (Firth, J. R. 1957:11)
This idea deserves to be explored and translated into concrete terms. Indeed, the context can be very variable, is it the previous word, the words of the sentence, the paragraph, the document? Do we “capture” the same semantics regardless of the context?
Another question arises: how do we represent the context? The simplest way to do this is, for a linguist who is a little out of touch, to mentally associate words commonly found around a term to be characterized. We could therefore create a characterization of dog and cat in the following way: intuitively, cat is associated with { meowing, eating, drinking, sleeping, hunting, petting, purring } while dog would be associated with { growling, barking, eating, drinking, sleeping, hunting, petting }. We can see that on this “mental experience” representation, the words are freely associated, and that the sets of words associated with cat and dog have common points, which we will find in a lot of domestic animals, as well as differences. On the other hand, it can be noted that other terms that are more specialized than dog and cat, such as spaniel, poodle, angora, etc. will probably be strongly associated with the same terms, but it can also be assumed that they will be associated with other words, such as race, reproduction, hair, more specifically characterizing these specialized terms.
If, rather than relying on a personal thought experiment to give the words that are evoked by a term, we rely on a corpus analysis, we can describe the neighboring words more precisely, by associating them with a number: the number of times that this word has been encountered in the context of the term to be characterized. We then end up with sets of torques (word, count), and from there there is only one step to represent the terms by a scattered vector (i.e. non-dense) on which we can define a distance, for example:
cat: {(meow, 14),(eat, 24),….}
Frequency-based vector representations: TF-IDF and PMI
TF-IDF (Term Frequency – Inverse Document Frequency)
One of the first truly exploitable vector representation models is the TF-IDF, introduced by Karen Spärck-Jones in 1972 and revived in 1984 by Gerard Salton. The TF-IDF method, first used in information retrieval, gives weight to words according to their frequency of appearance in a document while reducing the importance of common words appearing in many documents.
- TF (Term Frequency): The frequency of a word in a document.
- IDF (Inverse Document Frequency): Inverse of the number of documents containing this word, allowing to reduce the importance of frequent words in the entire corpus.
One of the key points of the method is that the IDF formula introduces a nonlinearity:

The use of the logarithm was justified a posteriori by the natural distribution of words that follows a Zipf’s law.
While TF-IDF allows for efficient representation of documents, it does not directly capture the semantics of words or the relationships between them. The resulting vectors are often very sparse and large. To be comparable, TF-IDF weight vectors linked to a document must use the same vocabulary and the same “dictionary” associating a word with a position in the matrix.
It must be kept in mind that TF-IDF makes sense mainly at the document level, the document x words matrix can certainly also be seen under the word axis, but the distribution of documents within a particular corpus will significantly bias the vector that can be associated with a word. In addition, the documents will have a greater or lesser weight depending on their size.
In short, TF-IDF is a limited method of weight regularization, but it is starting to give interesting results, especially by using a cosine distance between document vectors, a distance that “naturally” regularizes the vector norm and thus cancels the document weight.
PMI (Pointwise Mutual Information)
PMI is another statistical approach that measures the co-occurrence between two words x and y:

It makes it possible to detect strong associations between words. Unlike the TF-IDF matrices used to represent documents, PMI matrices are symmetrical and can only be used to characterize words. The cosine distance used on the rows or columns of these matrices gives interesting results on the proximity of words and allows us to find similar words. Depending on the co-occurrence window used, we will obtain more or less grammatical (same role), semantic (same meaning) or thematic (relating to the same subject) similarity effects.
Whether with TF-IDF or PMI, we obtain for the moment large slice vectors that look like this:
cat: {(meow, 0.85), (eat, 0.21), (purr, 0.66),…}
We can’t really talk about embeddings, because we remain in the “natural” space of description, each dimension can be fully explained. Almost by definition, an embedding must express a reality immersed in another referential. These large sparse vectors that make it possible to represent documents or words, while they allow us to express distances, do not allow us to manipulate these vectors easily, because they are sparse and because their values do not follow a gaussian distribution.
LSA/LSI: Dimension reduction and semantic analysis
Latent Semantic Indexing (LSA) is based on the singular value decomposition (SVD) of a co-occurrence matrix. It allows dimensionality to be reduced and latent structures to be extracted, thus capturing certain semantic relationships.
The general principle of SVD is to find a complete matrix factorization, more specifically a specific form of factorization in which a matrix (almost any, mathematicians will forgive me for the approximation) is factored in the form UΣVT, where you and V are “rotations” (i.e. matrices with orthogonal columns) and Σ is a rescaling matrix.
The LSA, in reality, is most often only interested in the U or V component, or sometimes the UΣ component, but above all in a small part of U or V, cut according to the main values of Σ. It is also referred to as a partial SVD, or “low-rank approximation”. In other words, we don’t want a total and perfect factorization, but an approximation, which has the advantage of cleaning up the noise.
In other words, the SVD makes it quite easy to select the “intrinsic” dimensions of the data that express the maximum variance, and it also makes it easy to regularize this variance, which has the effect of automatically regularizing the result.
The vectors from the LSA are much smaller than classical representations such as TF-IDF and allow to better capture similarities between words. These vectors can be used for information retrieval or document classification.
LSA/LSI was initially successful before being overwhelmingly rejected by both the Machine Learning community and the language community. We can hypothesize that the patent filed in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, or the “retrieval information” orientation of the idea, unless it is the strong “psychological science” component, have not helped its popularity. However, it can be considered that it is the first method to have produced embeddings. At the time, we spoke of semantic vectors and semantic vector spaces.
However, the notion of “matrix factorization” has been pursued by different means. In 2006, the Netflix Prize was an opportunity to shine the spotlight on a biased matrix factorization method, SVD++ (actually a gradient descent variant), which was already used in its unbiased version to perform a partial SVD without going through an algebraic approach. Indeed, SVD is an algebraic approach whose result is to minimize the reconstruction error (also called RMSE) while gradient descent is an optimization approach that minimizes the same error. The main difference between the two is the possibility in the case of gradient descent to apply it to nonlinear problems with some modifications on the loss function.
In any case, we ultimately obtain dense vectors (each dimension is used for all words), of fixed and reduced size (we are typically talking about a few tens to a few hundred, or even thousands of dimensions). Most often, we will also normalize the vectors, in order to simplify the calculation of cosine distances, which can then be implemented as a simple dot product.
chat: [0.0452219, -0.1109330, 0.201193, ….]
It should be noted in passing that without even factoring, it is quite possible to transform a sparse vector into a small dense vector by a random projection. However, a random projection will not add anything as “quality”, and rather than removing noise, it will add it.
Word2Vec: capturing semantic relationships
Developed by Google in 2013, Word2Vec relies on shallow neural networks to learn word embeddings. It offers two architectures:
- CBOW (Continuous Bag-of-Words): Predicts a target word from its context.
- Skip-gram : Predicts the context of a given target word.
Word2Vec vectors are typically dense and fixed-sized (e.g., 300 dimensions). They make it possible to carry out operations such as searching for the nearest neighbours (e.g. identifying synonyms) or analogy between words (“king – man + woman ≈ queen”).
It should be borne in mind that since the early 2000s, semantic algebra was evoked for retrieval information in the framework of LSI, the idea being to be able to express a negation or an addition (the OR operator) in the search for closest neighbors of the semantic vectors of LSI. The exceptional success of Word2Vec embeddings in semantic algebra has played a major role in its success, offering both a striking pedagogical illustration and performances well above the competition, even if, let’s be honest, no real-world application really uses word semantics algebra. Choosing a task where the method excels is always a good strategy in scientific communication, but Word2Vec did not directly have a very interesting declination in the characterization of a document or even a sentence, mainly due to the method of combining word embeddings which basically still must use somehow a weight inversely proportional to their frequency (in Word2Vec, the natural norm of words is proportional to their frequency).
Autoencoders: Unsupervised learning of representations
Autoencoders are neural networks designed to encode input into a low-dimensional latent space before reconstructing it. In NLP, they allow embeddings to be generated by learning efficient compression of textual data, facilitating tasks such as dimension reduction and learning semantic representations. They are a bit out of fashion, yet they have been at the origin of very significant advances in the field of unsupervised learning.
To summarize the basic idea perhaps more memorably, an auto-encoder is a neural network in which the training target is the same as the source, but for which a smaller middle layer is forced to train. In other words, we train a neural network to forget and then find a goal. In doing so, the behavior of the neural network forces it to create a more compact and “synthetic” representation of its inputs.
The resulting vectors can be used for anomaly detection or text classification, like all embeddings.
GloVe: a Co-occurrence matrices approach
GloVe (Global Vectors for Word Representation) combines the advantages of Word2Vec and co-occurrence matrices:
- It uses a matrix of co-occurrence of words over a large corpus.
- It optimizes an objective function based on log-probability co-occurrence relationships.
- It produces embeddings capturing analogical relationships and semantic structures.
In reality, apart from the probabilistic foundations of GloVe and gradient descent learning more rooted in a well-formed theoretical justification, the approach is very similar to LSA on co-occurrence matrices, or its “probabilistic” version, pLSI (Probabilistic LSI).
GloVe vectors are fixed and pre-trained on large corpora, allowing them to be used directly for tasks such as text classification or lexical similarity.
BERT and the Transformers: beyond embeddings
The advent of Transformers-based models, including BERT (Bidirectional Encoder Representations from Transformers), revolutionized word representation, or more accurately, gave an actual usage of these works for a task that happens to have real world applications: predicting the next word to produce human-like text.
The Transformers Mechanic
Introduced in the article Attention is All You Need (Vaswani et al., 2017), Transformers are based on the attention mechanism, which allows each word to focus on other relevant words in the sentence. Unlike Word2Vec, Transformers take into account the two-way context, thus offering richer and contextualized representations. The details of the mechanism is beyond the scope of this article, but the key idea is to be able to input a long preceding text, adding a kind of “summary” operator to look at interesting things in the past, so that predicting the next word can be must smarter than with a short context or with a recurrent network that have a long but brittle memory.
BERT: A contextual representation of words
BERT is pre-trained on large amounts of text using two tasks:
- Masked Language Model (MLM): Predict hidden words in a sentence.
- Next Sentence Prediction (NSP): Determine whether one sentence follows another.
This model produces context-dependent embeddings, which significantly improves performance on many NLP tasks. BERT-generated vectors are used in a variety of applications such as sentiment analysis, text classification, and question answering.
Leaving aside the disruptive elements of transformers (attention mechanism, positional encoding), the embeddings themselves are only moderately better than those of approaches like GloVe or even some variants of LSA. Indeed, to extract a text embedding from transformers, we first see 1/ the tokenization mechanism and 2/ the embedding interception mechanism.
Indeed, if in LSA or GloVe, tokens are words and contexts are co-occurrence windows, in BERT or in transformers, the “task” for which embeddings are optimized is almost exclusively to predict the next token. The information embedded in a sentence is therefore very rich and mixes many aspects that are not necessarily interesting for the tasks of classification or finding nearest neighbors. Literally, two embeddings of transformers can be identical if the sequence of the text that can be expected is the same, even if the beginning of the text is different. The embeddings of words taken from LLMs are therefore of little interest because they encode information on spelling, grammar, semantics, theme, etc. However, very often embeddings can be used with only one of these aspects in mind.
What to do with embeddings?
First of all, embeddings can be used to find similar elements. An example is a comparison made a few years ago between neighbors according to the LSA, Word2Vec and NCISC (Babbar’s in-house algorithm).
LSA | Word2Vec | NCISC | ||||||
Distance | Word | Word frequency | Distance | Word | Word frequency | Distance | Word | Word frequency |
1.000 | iffy | 279 | 1.000 | iffy | 279 | 1.000 | iffy | 279 |
0.809 | nitpicky | 125 | 0.971 | nitpicky | 125 | 0.840 | debatable | 711 |
0.807 | miffed | 69 | 0.969 | 265 | 0.835 | keepable | 67 | |
0.805 | Shaky | 934 | 0.968 | rediculous | 104 | 0.830 | contestable | 179 |
0.804 | sketchy | 527 | 0.967 | far-fetched | 262 | 0.822 | sketchy | 527 |
0.798 | Clunky | 372 | 0.965 | counter-intuitive | 185 | 0.814 | unsatisfactory | 1176 |
0.795 | Dodgy | 797 | 0.964 | presumptuous | 126 | 0.812 | unclear | 4445 |
0.792 | Fishy | 494 | 0.964 | contestable | 179 | 0.805 | nitpicky | 125 |
0.788 | listy | 211 | 0.964 | usefull | 183 | 0.804 | us-centric | 170 |
0.787 | Picky | 397 | 0.962 | Clunky | 372 | 0.802 | Dodgy | 797 |
0.785 | underexposed | 73 | 0.962 | counterintuitive | 203 | 0.798 | salvagable | 118 |
0.784 | unsharp | 130 | 0.962 | un-encyclopedic | 101 | 0.798 | Shaky | 934 |
0.775 | choppy | 691 | 0.961 | worrisome | 213 | 0.797 | counter-intuitive | 185 |
0.770 | nit-picky | 90 | 0.961 | self-explanatory | 156 | 0.796 | Ambiguous | 41 |
0.763 | Fiddly | 43 | 0.961 | unecessary | 143 | 0.791 | Offputting | 33 |
0.762 | muddled | 369 | 0.960 | nit-picky | 80 | 0.790 | questionable | 6541 |
0.762 | Wonky | 196 | 0.959 | wordy | 413 | 0.790 | notable | 78 |
0.762 | disconcerting | 226 | 0.958 | disconcerting | 226 | 0.786 | unconvincing | 567 |
0.760 | Neater | 121 | 0.958 | disingenuous | 534 | 0.781 | wrong | 12041 |
0.760 | dissapointed | 32 | 0.958 | off-putting | 104 | 0.780 | Clunky | 372 |
The objective of this comparison was to measure the extent to which the different algorithms encoded the frequency of words in their embeddings, we found that the LSA proposed words with a frequency varying from 0.1 times to 3 times the frequency of the requested word (iffy in this case), that Word2Vec gave neighbors with a frequency between 0.3 times and 2 times the requested frequency, and that NCISC, gave it neighbors with a frequency between 0.1 and 43 times the frequency of the word being queried. These results showed that the embeddings of LSA and Word2Vec encoded something other than the meaning of words, in particular their frequency, and that this was all the more true for Word2Vec.
In addition, embeddings allow you to do classification, clustering (which consists of grouping similar elements together), and of course to create embeddings of combined elements, for example sentences, documents, etc. Creating embeddings of combined elements remains a particular problem, as the “bag of words” approach consists of making a weighted sum of the embeddings of words in a sentence or document with certain limitations, even though it works extremely well for documents of a certain size.
Conclusion
Embeddings have evolved from simple statistical approaches to complex neural models using billions of parameters. With the rise of GPT and BERT-like models, vector representations of words have never been more powerful, making NLP jobs more powerful and sophisticated than ever. The vectors obtained from these models enable advanced applications such as named entity detection, text generation, and automatic summarization.