Skip to main content

What are Embeddings? How Do They Help AI Understand the Human World?

Posted on Apr 9, 2019

The term “embedding” has become quite common in the descriptions of AI systems only during the last few years. It first appeared in the works of specialists in Natural Language Processing (NLP). It means a process or, more often, the result of a process of transforming a language entity (a word, sentence, paragraph, or the whole text) into a set of numbers — a numerical vector. In the Russian-language literature, embeddings are numerical vectors that are derived from words or other language entities. The numerical vector of k dimension is a list of k numbers, in which the order of numbers is strictly defined. For example, (2.3, 1.0, 7.35) could be considered a three-dimensional vector and (1, 0, 0, 2, 0.1, 0, 0, 7.9) — an eight-dimensional numerical vector.

In the most primitive form, word embeddings are created by simply enumerating words in some rather large dictionary and setting a value of 1 in a long dimensional vector equal to the number of words in the dictionary. For example, let’s take Ushakov’s Dictionary and enumerate all words from the first one to the last one.  Thus, the word “abacus” is converted to number 5, and the “lampshade” — to 7, and so on. The total number of words in the dictionary is 85,289. The embedding of the word “abacus” will have 85,288 zeros in all positions except the 5th one, where it will be 1, and the word “lampshade” will have zeros in all 85,288 positions except the 7th one, where it will be 1. This method of building embeddings is called unitary coding, and in the modern English literature — one-hot encoding. Any sentence in Russian can be set a sequence, more correctly from a mathematical point of view to say, a tuple of such 85,289-dimensional vectors. And then actions with words can be transformed into actions with these numerical vectors, which is inherent in the computer itself. However, it is not that simple. The first problem of applying such embeddings that you will encounter is the absence of the word for which an embedding is sought in the selected dictionary. Look at Ushakov’s Dictionary mentioned above, and you will not find such a popular word as “computer” there. It is possible to significantly reduce the likelihood of such a problem by not using a special dictionary, but numbering words in an arbitrary extensive set of texts, for example, in Wikipedia, the Great Russian Encyclopedia.  Today for these purposes, special sets are created called text corpuses.


What actions over numerical equivalents of words would we like to perform and why? Probably, so that the computer itself could take any actions depending on the content of the text it has, without human intervention. However, the use of corpuses does not in itself help to derive any benefit from turning a particular text into a tuple of numbers. After all, any text in a natural language is not only a collection of words, but also carries some semantics and meaning. And the task to train a computer system to somehow understand the meaning of the text, to extract semantic information from it, is unsolvable if primitive embedding is used. Therefore, the next step in NLP was made by taking into account how often each word of a language (term) is found in a corpus and how important its appearance is in a specific text. Thus, frequency embedding emerged, in which each word, in the position corresponding to its number, is assigned a number - TF - Term Frequency, or rather the corrected frequency value - TF / IDF. If everything is obvious for the first concept: for each word in the text its number of occurrences is calculated and divided by the total number of words, the second term is more complicated. IDF - Inverse document frequency stands for the inverse (inverted) frequency of the document. It is the inversion of frequency, with which a certain word appears in the corpus of texts (documents). Due to this indicator, it is possible to reduce the weight of the most widely used words (prepositions, conjunctions, common terms, and concepts). For each term within the framework of a specific corpus, only one single IDF value is provided. The TF/IDF indicator will be higher if a certain word is used with great frequency in a specific text, but rarely in other documents. Using embeddings in the form of such vectors, for the first time it was possible to carry out an automatic semantic analysis of texts, determining the topics in the corpus and classifying texts by main topics.

There are several successfully used algorithms for such analysis. Among them are the classic LSA – Latent Semantic Analysis, LDA – Latent Dirichlet Allocation, and BTM – Biterm Topic Model. The use of such models, for example, made it possible to sort out the giant flows of emails by subject and send them according to the prescribed rules. At this stage, a powerful set of technologies began to form within NLP, called NLU - Natural Language Understanding. In the revolutionary work of Tomash Mikolov and his colleagues in 2013, it was proposed to use the hypothesis of locality: “words with similar meanings occur in the same environments” (Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119, 2013). Proximity in this case is understood very broadly, as the fact that only matching words can stand next to each other. For example, the phrase "a wind-up alarm clock" is common to us. But we can’t say "wind-up ocean", as these words do not collocate. To obtain such properties, it is necessary to build embeddings of words in a high-dimensional vector space (but independent of the number of words). A set of 200-500 numbers matched each word, and these sets satisfied the properties of a mathematical vector space — they could be added, multiplied by scalar quantities, it was possible to find distances between them, and each such action with number vectors made sense as some action on words. The most interesting thing that resulted in multidimensional space is the transfer of many semantic relations of words to the relations of the corresponding vectors. From the point of view of mathematics, one can speak of a homomorphism of a natural language and a multidimensional vector space.  All publications and lectures on embeddings today are illustrated by a famous image describing what was said.


We can see that the semantic relation MAN ~ WOMEN for embeddings of these words is reduced to the presence of a certain vector of difference between them, which is surprisingly preserved for the equivalent semantic relation UNCLE ~ AUNT, KING ~ QUEEN. This allows writing down a simple mathematical relationship: WOMAN-MAN = QUEEN-KING. Let's make a simple transformation of this formula: WOMAN-QUEEN=MAN-KING. It looks fair: a woman without the title of queen is the same thing as a man without the title of king. But the second image explains that embeddings retain the “one” ~ “many” relationship. Mikolov called the method of obtaining such embeddings as word2vec. It is based on the use of a probabilistic assessment of the joint use of word groups and the neural network that is self-learning on the corpus of texts. The idea turned out to be fruitful and soon we saw the construction of even more sophisticated models for embeddings of both individual words and sentences, as well as whole documents. This is the GloVe model developed at Stanford, fastText developed by Facebook, and doc2vec, a model that displays a whole document in a numerical vector. In recent years, embeddings are obtained using very complex models of deep learning in order to preserve ever more subtle natural language relations in the properties of vectors. The results are so impressive that experts have noted the emergence of models such as ELMo and BERT as a new era of embeddings. (Jay Alammar, The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning).

Understanding the complexity of models of this level, I’d like to describe how embeddings in the popular model BERT, developed by Google AI Language in 2018, are built today.

It is based on the neuroarchitecture called Transformer that has an attention mechanism that learns contextual relations between words (or sub-words) in a text. Each word is encoded with a unique token and the sequence of words is led to the so-called recurrent neural network to predict some numerical multidimensional vector – embedding. In BERT, the Transformer architecture is not fully used: only the input network, called an encoder. In the image below, the BERT architecture is displayed.



State of the Art Language Model for NLP by Rani Horev


Before feeding word sequences in BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words based on the context provided by other, non-masked, words in the sequence. From a technical point of view, the prediction of the output words requires:

  • Adding a classification layer on top of the encoder output
  • Multiplying the output vectors by the embedding matrix, transforming them in the vocabulary dimens
  • Calculating the probability of each word in the vocabulary using softmax – the function that normalizes the activation values of the output layer of the neural network

BERT can predict not only words, but also the whole sentences. In the process of training, the BERT model receives pairs of sentences as input and learns to predict whether the second sentence in a pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the remaining 50% a random sentence from the corpus is selected as the second sentence. It is assumed that the random sentence will be disconnected from the first sentence.

BERT can be used for a wide variety of language tasks, adding only a small additional layer of neurons to the core model.

  • Classification tasks such as sentiment analysis are performed similarly to the Next Sentence classification, adding a classification layer on top of the Transformer output for the [CLS] token.
  • In Question Answering tasks, the software receives a question regarding a text sequence and should mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two additional vectors that mark the beginning and the end of the answer.
  • In Named Entity Recognition (NER), the model receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc.) that appear in the text. Using BERT, the NER model can be trained by feeding the output vector of each token into the classification layer that predicts the NER label – geographical name, name, company name, etc.

The results of implementing BERT embeddings are impressive. In addition to the usual assessments of the tone of the text – positive and negative statements, the computer began to determine the presence of sarcasm in the text as well as detect lies and fear.  These are the deep features of human psychology that can be turned into algebraic relations of embeddings.


Embeddings have opened up the possibility of simultaneously operating in different natural languages. After all, if we construct the space of sentences and words embeddings in English and Russian, then the same embeddings should correspond to the same semantic concepts. Such a combination should be carried out in the process of teaching a neurotranslator. Then the translation of the new text from English will be reduced to its embedding and decoding in the words of the Russian language, which you need to translate. There are known search engines that accept a request in one language and search for information in any language using a reverse index based on embeddings.

Artificial intelligence (AI) is open to a mass of tasks, not only to understand what is said by a person and to choose in advance declared possible solutions based on them, but also to build solutions. The achievement of such goals in AI systems is carried out using architectures with many neural networks, genetic algorithms, trees of choice, and others. All of them, as a rule, work efficiently if data is represented as numerical vectors. This means that all data for artificial intelligence should be represented as embeddings. The experience of word embedding in NLU that we have just described allows assuming that homomorphic transformations should be performed with other entities that AI operates with, keeping in mind basic relations that exist objectively in the set of entities used. Recently, several papers have appeared on embeddings of entities different from linguistic studies. However, one can doubt here if the both artificial intelligence and natural intelligence need to know some entities, other than those expressed by means of a language, either natural or artificial, but perceived by a man. In the end, the relations between entities are described by means of a language, thus, they can be treated equally along with the relations of words, sentences, and texts. It suggests that the embedding path for any entities that AI must operate with is promising and correct.

Let’s have a look at a number of examples. The social platform Pinterest has created and uses 128-dimensional embeddings for entities called Pin-pages or images from the Internet and entities of Pinner users. A method similar to word2vec, the so-called Pin2Vec, was developed and used to reflect the context of each user's and each Pin's relation.

The author of this article conducted research on the use of embeddings to represent the legal space – articles of the criminal code, civil code, labor code, court decisions together with the presentation of narratives (narrative texts) describing some facts. Already today, we have managed to build a high quality AI that can replace the court system for qualifying case materials at the stage of drafting a court decision: which normative acts are violated in describing the facts presented by the narrative. You will find an interactive acquaintance with the world of three-dimensional space with points corresponding to both normative acts and random textual narratives. If the link for any reason is inoperable, then have a look at the following image.


The number of publications on the use of embeddings in the development of AI systems is increasing. In general, it is already possible to say that a fairly universal approach can be the construction of a textual description of any state of the world that AI sees and the further construction of a vector numerical image for this text, embedding in the usual sense. This approach is based on the idea that AI must “think” in words, in a language form. Another idea is based on the assumption that the states of the world can be transformed into embeddings bypassing a verbal description, for example, images or audio recordings can be immediately transformed into multidimensional vectors. If you train a model for such embedding in conjunction with texts, AI will be able to operate both with homogeneous data and pictures, and with words and sounds. Recently, Dan Gillick at Google, in his lecture at Berkeley, proposed building AIs to search for information by placing all different objects/entities, regardless of whether they are composed of text, images, video, or audio in the same vector space. Based on this principle, AI will be able to answer questions asked in various languages, illustrations and sound recordings, in writing or orally. What dimensions of embeddings will be required for such universal descriptions and whether the structure and capacity of the multidimensional vector space is sufficient to keep all the necessary complexity and diversity of the world in which AI should work is a matter of current and future research.