Embeddings in NLP(Word Vectors, Sentence Vectors)

Mahesh Patel
8 min readOct 2, 2020

--

# I have written this blog to help decide which word embedding to use at a given situation

Mahesh Patel, 01 OCT, 2020

When you enter the world of text processing embeddings are one of most important topic you must know. Classification of text, Similarity between two text, Dimensionality reduction in text data, NER extraction etc. are some of the popular applications of embeddings. Here I am going to walk you through all type of embeddings, what are pros and cons and when to use which embedding.

What is Embedding?

It is a type of representation where we map text data to vectors of real number. eg: Representing man to a vector of length 4(man -> [.5,.4,.8,.1]).

Why is Embedding?

  • When we use vectorization methods like TF-IDF, Bag of words etc., they converts your sentences in to vectors of size equals the length of unique words in whole vocabulary.
  • Using Embedding we can limit the length of vectors(Dimensionality Reduction).
  • When you try to model your text data you cannot feed text to model, model needs numbers, so you convert text to numbers.
  • So we needed embeddings to better model our text data.

Types of embedding

There are fundamentally two types of embedding.

  1. Word embedding.

2. Sentence embedding.

Will Start With Word Embeddings

  • It is a type of representation where we map words of text data to vectors of real number. eg: Representing man to a vector of length 4(man -> [.5,.4,.8,.1]).

Why Word Embeddings

  • When it comes to language modelling, most of the time we start with Bag of words model.
  • Bag of words language modelling technique is old and not that much effective for context based vectors.
  • Because it gives same importance to every word( that is either 0 or 1).
  • So we needed new techniques to capture context of words and that can be achieved by word vectors.
  • It is also a form of dimensionality reduction.

Types of word Embedding

There is four major type of word embeddings

  • Word2Vec(Google 2013)
  • GloVe(Stanford 2014)
  • Fasttext(Facebook 2015)
  • ELMo(2018)

1. Word2Vec

  • So in 2014 google came up with Word2Vec model.
  • It uses a shallow neural network to perform vectorization.
  • It uses two methods to do that(CBOW and Skip Gram).
  • It works on next word prediction concept.

What it Does?

  • It converts words to softmax probabilities(a number between 0–1) of a given dimension.
  • It works on principal as if a words appear in same context in a sentence then they should have similar probabilities.
  • Like king and queen have similar royal properties, but other properties will be different.
  • Man and women have different properties.
  • So if we do something like.
  • Vector(“King”) — Vector(“Man”)+Vector(“Woman”) = Word(“Queen”)
    - king = royal+man
    - queen = royal+women
    - king-man+women = (royal+man-man+women) = Queen

Use case

Here I implemented one of many use cases(Twitter hate classification). Please refer below link: link to github.

Problem with Word2Vec

  • If a word appears at two different places, it gives same vector for both.
  • Since context is different , so vectors also should be different.
  • Ex : [I like to sit near bank of river., I went to bank yesterday.]
  • The ‘bank’ word in both sentence have different context.
  • Out of bag words is not handled by word2vec.

2. GloVe Vectors(Global Vectors for word representation)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Glove is based on matrix factorization techniques on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently we see this word in some “context” (the columns) in a large corpus. The number of “contexts” is of course large, since it is essentially combinatorial in size. So then we factorize this matrix to yield a lower-dimensional (word x features) matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.

Is GloVe Better than Word2Vec?

Both CBOW and Skip-Grams are “predictive” models, in that they only take local contexts into account. Word2Vec does not take advantage of global context. GloVe embeddings by contrast leverage the same intuition behind the co-occuring matrix used distributional embeddings, but uses neural methods to decompose the co-occurrence matrix into more expressive and dense word vectors. While GloVe vectors are faster to train, neither GloVe or Word2Vec has been shown to provide definitively better results rather they should both be evaluated for a given dataset.

Refer this link which gives you step by step guide to use GloVe

3. Fasttext

It has one advantage over other two, it handles out of bag words, which was problem with Word2Vec and GloVe.

FastText, builds on Word2Vec by learning vector representations for each word and the n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to training it enables word embeddings to encode sub-word information. FastText vectors have been shown to be more accurate than Word2Vec vectors by a number of different measures.

Is Fasttext better than GloVe and Word2Vec?(Yes it Is)

Generate better word embeddings for rare words ( even if words are rare their character n grams are still shared with other words — hence the embeddings can still be good).

Out of vocabulary words — they can construct the vector for a word from its character n-grams even if word doesn’t appear in training corpus, both Word2vec and Glove can’t.

For Fattext implementation refer this Analytics Vidhya blog

4. ELMo

Why is need of ELMo?

Above all gives same word vector for a single word at every possible place in the sentence.(Eg. Bank of river and money bank will have same vectors, but they have different context so they should have different vectors)

ELMo gives different word vector for different context of the same word.

Robust out of vocabulary words handling.

All of these scenario explained in detail below.

ELMo is a novel way to represent words in vectors or embeddings. These word embeddings are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks.

ELMo is a model generates embeddings for a word based on the context it appears thus generating slightly different embeddings for each of its occurrence.

How ELMo works?

ELMo word vectors are computed on top of a two-layer bidirectional language model (biLM). This biLM model has two layers stacked together. Each layer has 2 passes — forward pass and backward pass:

  • The architecture above uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors.
  • These raw word vectors act as inputs to the first layer of biLM.
  • The forward pass contains information about a certain word and the context (other words) before that word.
  • The backward pass contains information about the word and the context after it.
  • This pair of information, from the forward and backward pass, forms the intermediate word vectors.
  • These intermediate word vectors are fed into the next layer of biLM.
  • The final representation (ELMo) is the weighted sum of the raw word vectors and the 2 intermediate word vectors.

As the input to the biLM is computed from characters rather than words, it captures the inner structure of the word. For example, the biLM will be able to figure out that terms like beauty and beautiful are related at some level without even looking at the context they often appear in. Sounds incredible!

How is ELMo different from other word embeddings?

Suppose we have a couple of sentences:

  1. I read the book yesterday.
  2. Can you read the letter now?
  • Take a moment to ponder the difference between these two. The verb “read” in the first sentence is in the past tense. And the same verb transforms into present tense in the second sentence. This is a case of Polysemy wherein a word could have multiple meanings or senses.
  • Traditional word embeddings come up with the same vector for the word “read” in both the sentences. Hence, the system would fail to distinguish between the polysemous words. These word embeddings just cannot grasp the context in which the word was used.
  • ELMo word vectors successfully address this issue. ELMo word representations take the entire input sentence into equation for calculating the word embeddings. Hence, the term “read” would have different ELMo vectors under different context.
  • Contextual: The representation for each word depends on the entire context in which it is used.
  • Deep: The word representations combine all layers of a deep pre-trained neural network.
  • Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.

For practical implementation refer this link

Sentence Embedding

Easiest way we can add up the word vectors in a sentence to form a sentence vector or we can average them.

  • Skip-thoughts vectors(2015) : It is similar to Skip gram model, here we have to predict surrounding sentences give current sentence. It uses RNN base encoder decoder for training.
  • InferSent(2017) : It is also a classification based approach. Target sentence and input sentence both are encoded using same encoder. It uses Bi-Directional LSTM.
  • Quick-thoughts vectors(2018) : It uses the classification approach to predict the next sentence. Decoder replaced by classifier. It is faster than Skip through vectors.
  • Multi-task learning(2018): It is also a classification based approach. Here Encoder uses Transformer- Network.

Please see this link for complete understanding of sentence embeddings

Conclusion

You first have to identify your problem statement and does your problem really depends on context and if depends then how much. If your problem highly Context dependent, you should use Elmo. If our problem is less dependent on context then Word2Vec or Glove can work.

Next in Line :

  • Embedding using pretrained BERT
  • Details on different Sentence embeddings

Refrences

# Any Suggestion will be helpful …

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Mahesh Patel
Mahesh Patel

No responses yet

Write a response