Word2Vec -Understanding word vectors

Mahesh Patel
6 min readJan 26, 2021

--

Mahesh Patel, Data Science Researcher

Jan26, 2021

Aim Of this article : To get Familiar with Word2Vec and its types and also code implementation in python using Gensim.

What is Embedding?

It is a type of representation where we map text data to vectors of real number. eg: Representing man to a vector of length 4(man -> [.5,.4,.8,.1]). It has been introduced in 2013 by Google.

Why Word2Vec Embedding?

  • When we use vectorization methods like TF-IDF, Bag of words etc., they converts your sentences in to vectors of size equals the length of unique words in whole vocabulary.
  • Using W2V Embedding we can limit the length of vectors(Dimensionality Reduction).
  • When you try to model your text data you cannot feed text to model, model needs numbers, so you convert text to numbers.
  • So we needed embeddings to better model our text data.

What it Does?

Graphical Representation of Word2Vec
  • It converts words to softmax probabilities(a number between 0–1) of a given dimension.
  • It works on principal as if a words appear in same context in a sentence then they should have similar probabilities.
  • Like king and queen have similar royal properties, but other properties will be different.
  • Man and women have different properties.
  • So if we do something like.
  • Vector(“King”) — Vector(“Man”)+Vector(“Woman”) = Vecctor(“Queen”)
    > king = royal+man
    > queen = royal+women
    > king-man+women = (royal+man-man+women) = Queen
  • It works on principal as if a words appear in same context in a sentence then they should have similar probabilities.
CBOW and SKIPGRAM
  • Like king and queen have similar royal properties, but other properties will be different.
  • Man and women have different properties.

Types of Word2Vec embedding

  1. CBOW
  2. SKIP GRAM

CBOW :

The CBOW model learns to predict a target word leveraging all words in its neighborhood. The sum of the context vectors are used to predict the target word. The neighboring words taken into consideration is determined by a pre-defined window size surrounding the target word.

SkipGram :

The SkipGram model on the other hand, learns to predict a word based on a neighboring word. To put it simply, given a word, it learns to predict another word in it’s context.

Implementation using Gensim :

First install gensim : pip install — upgrade gensim

Gensim provides the Word2Vec class for working with a Word2Vec model.

Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance. For example:

sentences = …
model = Word2Vec(sentences)

Specifically, each sentence must be tokenized, meaning divided into words and prepared (e.g. perhaps pre-filtered and perhaps converted to a preferred case).

The sentences could be text loaded into memory, or an iterator that progressively loads text, required for very large text corpora.

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

  • size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
  • window: (default 5) The maximum distance between a target word and words around the target word.
  • min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
  • workers: (default 3) The number of threads to use while training.
  • sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

The defaults are often good enough when just getting started. If you have a lot of cores, as most modern computers do, I strongly encourage you to increase workers to match the number of cores (e.g. 8).

After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made.

For example, you can print the learned vocabulary of tokens (words) as follows:

words = list(model.wv.vocab)
print(words)

You can review the embedded vector for a specific token as follows:

print(model[‘word’])

A trained model can then be saved by calling the save_word2vec_format() function on the word vector model.

By default, the model is saved in a binary format to save space. For example:

model.wv.save_word2vec_format(‘model.bin’)

When getting started, you can save the learned model in ASCII format and review the contents.

You can do this by setting binary=False when calling the save_word2vec_format() function, for example:

model.wv.save_word2vec_format(‘model.txt’, binary=False)

The saved model can then be loaded again by calling the Word2Vec.load() function. For example:

model = Word2Vec.load(‘model.bin’)

Lets Put it all together :

from gensim.models import Word2Vec
# define training data
sentences = [[‘this’, ‘is’, ‘the’, ‘first’, ‘sentence’, ‘for’, ‘word2vec’],
[‘this’, ‘is’, ‘the’, ‘second’, ‘sentence’],
[‘yet’, ‘another’, ‘sentence’],
[‘one’, ‘more’, ‘sentence’],
[‘and’, ‘the’, ‘final’, ‘sentence’]]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model[‘sentence’])
# save model
model.save(‘model.bin’)
# load model
new_model = Word2Vec.load(‘model.bin’)
print(new_model)

Running Above model results below :

Word2Vec(vocab=14, size=100, alpha=0.025)
[‘second’, ‘sentence’, ‘and’, ‘this’, ‘final’, ‘word2vec’, ‘for’, ‘another’, ‘one’, ‘first’, ‘more’, ‘the’, ‘yet’, ‘is’]
[ -4.61881841e-03 -4.88735968e-03 -3.19508743e-03 4.08568839e-03
-3.38211656e-03 1.93076557e-03 3.90265253e-03 -1.04349572e-03
4.14286414e-03 1.55219622e-03 3.85653134e-03 2.22428422e-03
-3.52565176e-03 2.82056746e-03 -2.11121864e-03 -1.38054823e-03
-1.12888147e-03 -2.87318649e-03 -7.99703528e-04 3.67874932e-03
2.68940022e-03 6.31021452e-04 -4.36326629e-03 2.38655557e-04
-1.94210222e-03 4.87691024e-03 -4.04118607e-03 -3.17813386e-03
4.94802603e-03 3.43150692e-03 -1.44031656e-03 4.25637932e-03
-1.15106850e-04 -3.73274647e-03 2.50349124e-03 4.28692997e-03
-3.57313151e-03 -7.24728088e-05 -3.46099050e-03 -3.39612062e-03
3.54845310e-03 1.56780297e-03 4.58260969e-04 2.52689526e-04
3.06256465e-03 2.37558200e-03 4.06933809e-03 2.94650183e-03
-2.96231941e-03 -4.47433954e-03 2.89590308e-03 -2.16034567e-03
-2.58548348e-03 -2.06163677e-04 1.72605237e-03 -2.27384618e-04
-3.70194600e-03 2.11557443e-03 2.03793868e-03 3.09839356e-03
-4.71800892e-03 2.32995977e-03 -6.70911541e-05 1.39375112e-03
-3.84263694e-03 -1.03898917e-03 4.13251948e-03 1.06330717e-03
1.38514000e-03 -1.18144893e-03 -2.60811858e-03 1.54952740e-03
2.49916781e-03 -1.95435272e-03 8.86975031e-05 1.89820060e-03
-3.41996481e-03 -4.08187555e-03 5.88635216e-04 4.13103355e-03
-3.25899688e-03 1.02130906e-03 -3.61028523e-03 4.17646067e-03
4.65870230e-03 3.64110398e-04 4.95479070e-03 -1.29743712e-03
-5.03367570e-04 -2.52546836e-03 3.31060472e-03 -3.12870182e-03
-1.14580349e-03 -4.34387522e-03 -4.62882593e-03 3.19007039e-03
2.88707414e-03 1.62976081e-04 -6.05802808e-04 -1.06368808e-03]
Word2Vec(vocab=14, size=100, alpha=0.025)

Whats Next :

  1. Now you got word embeddings you can use it to get similarity between two words using cosine formula.
  2. Similar words will have score close to 1 and dissimilar words will have score close to 0.
  3. We can also calculate sentence embeddings using word embeddings for every sentence and then take sum of words embeddings in the sentence.
  4. After getting Sentence embedding we can get similar sentences using cosine similarity.

Advantages :

  1. Above explained usage are the advantages of Word2Vec Model.
  2. In short context similarity, dimention reduction and number representation of words and sentences are the advantages.

Disadvantages :

  1. It gives same vector representation to a word used in different context(Like Bank of river and money bank will have same vector).
  2. Out of Vocabulary words are not handled by Word2Vec model.

Conclusion:

When We need context similarity or semantic similarity we use word vectors. Here it is Word2Vec. For Starters it is good to start with Word2Vec, but it does not scale well due to its limitation of Out of bag word mishandling and same vector for same word in different context.

Please Do suggest improvements.

Next In Line :

Glove, Fasttext,Elmo embeddings.

References :

--

--