Word2Vec

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. 

Words with similar meanings should have similar vector representations, according to the main principle of Word2Vec. Word2Vec utilizes two architectures:

CBOW (Continuous Bag of Words): CBOW predicts a target word (the word in the center) based on the surrounding context words within a given window size.

How CBOW Works:

  1. Context and Target Word:
    • In CBOW, the context consists of a set of words surrounding a particular target word. The model uses these context words to predict the target word.
    • For example, in the sentence “The cat sat on the mat,” if we use “sat” as the target word, the context words could be [“The”, “cat”, “on”, “the”, “mat”].
  2. Input Representation:
    • The input to the CBOW model is a one-hot encoded vector for each context word. A one-hot vector is a binary vector where only one element is “1,” representing the presence of a specific word, while all other elements are “0.”
  3. Averaging Embeddings:
    • The model averages the word embeddings (dense vectors) of the context words to create a single averaged context vector.
  4. Prediction:
    • This averaged context vector is then passed through the neural network, which predicts the probability of each word in the vocabulary being the target word.
    • The word with the highest probability is chosen as the predicted target word.

Example:

Consider the following sentence:

“The quick brown fox jumps over the lazy dog.”

Let’s say we’re using a window size of 2, meaning we’ll consider two words on either side of the target word.

  1. Target Word: “fox”
  2. Context Words: [“The”, “quick”, “jumps”, “over”]

The model will take the context words “The,” “quick,” “jumps,” and “over” and try to predict the target word “fox.”

  • One-Hot Encoding:
    • “The,” “quick,” “jumps,” and “over” are converted into one-hot vectors.
    • Each word has its corresponding position in a vocabulary of all unique words.
  • Averaging:
    • The one-hot vectors are transformed into dense embeddings (through the embedding matrix) and averaged.
  • Prediction:
    • The model predicts the target word based on this averaged vector. Ideally, it will predict “fox” as the most probable word.

Mathematical Representation:

Given the sentence “The quick brown fox jumps over the lazy dog,” if the context window size is 2, the training example for predicting “brown” would look like this:

  • Context: [“The”, “quick”, “fox”, “jumps”]
  • Target: “brown”

The CBOW model aims to learn a function ff such that:

f(The,quick,fox,jumps)≈brownf(The,quick,fox,jumps)≈brown

Training:

During training, the model adjusts its parameters to minimize the difference between the predicted word and the actual target word (using loss functions like cross-entropy). Over time, the model learns to predict target words more accurately based on their contexts.

Advantages of CBOW:

  • Efficiency: CBOW is faster to train compared to the Skip-Gram model, especially when dealing with large datasets.
  • Generalization: CBOW tends to generalize better, meaning it performs well in predicting frequent words.

Limitations:

  • Equal Weighting: CBOW averages the context words without considering their positions relative to the target word, which might lose some contextual information.
  • Infrequent Words: It may not capture the nuances of rare or infrequent words as effectively as Skip-Gram.
from gensim.models import Word2Vec

# Example sentence corpus
sentences = [["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]]

# Training the model using CBOW (sg=0 indicates CBOW, sg=1 indicates Skip-Gram)
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=0)

# Predicting similar words to 'fox'
similar_words = model.wv.most_similar('fox')
print(similar_words)

Skip Gram : 

Unlike CBOW, which predicts the target word based on its context words, Skip-Gram does the reverse: it predicts the context words given a target word.

How Skip-Gram Works:

  1. Target and Context Words:
    • The Skip-Gram model takes a target word as input and tries to predict the surrounding context words within a specified window size.
    • For example, in the sentence “The cat sat on the mat,” if “sat” is the target word, Skip-Gram tries to predict the words “The,” “cat,” “on,” “the,” and “mat.”
  2. Input Representation:
    • The input to the Skip-Gram model is a one-hot encoded vector representing the target word.
    • A one-hot vector is a binary vector with a “1” at the index corresponding to the word in the vocabulary and “0”s elsewhere.
  3. Prediction:
    • The model uses the target word’s embedding to predict each of the context words individually.
    • It outputs a probability distribution over the entire vocabulary, with the most likely context words having higher probabilities.

Example:

Consider the sentence:

“The quick brown fox jumps over the lazy dog.”

Let’s say the target word is “fox,” and we have a context window size of 2. Skip-Gram will try to predict the words around “fox.”

  1. Target Word: “fox”
  2. Context Words: [“quick”, “brown”, “jumps”, “over”]

In the Skip-Gram model:

  • Input: The one-hot encoded vector for “fox.”
  • Output: Predictions for the context words [“quick”, “brown”, “jumps”, “over”].

For each context word, the model will output a probability distribution across the entire vocabulary. The model is trained to maximize the probability of the actual context words appearing around the target word.

Mathematical Representation:

For a given word “fox” in the sentence, Skip-Gram tries to predict the surrounding words “quick,” “brown,” “jumps,” and “over.”

Let wtwt​ be the target word, and wt−iwt−i​ and wt+iwt+i​ be the context words within the window size. The Skip-Gram model aims to maximize the following probability:

∏−c≤j≤c,j≠0P(wt+j∣wt)∏−c≤j≤c,j=0​P(wt+j​∣wt​)

Where cc is the window size.

Training:

During training, the Skip-Gram model updates its parameters (word embeddings) to maximize the probability of the correct context words given the target word. It does this by minimizing a loss function, usually the negative log-likelihood of the correct context words.

Advantages of Skip-Gram:

  1. Effective for Rare Words:
    • Skip-Gram works well with smaller datasets and is particularly effective in representing infrequent words or phrases.
  2. Captures Semantic Relationships:
    • It captures subtle semantic relationships between words, especially when the window size is large.

Limitations:

  1. Computational Complexity:
    • Skip-Gram is generally slower to train than CBOW, especially when the vocabulary is large because it predicts multiple context words for each target word.
  2. Increased Data Sparsity:
    • Since Skip-Gram focuses on individual target-context pairs, it can suffer from data sparsity issues when dealing with less frequent words.
from gensim.models import Word2Vec

# Example sentence corpus
sentences = [["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]]

# Training the model using Skip-Gram (sg=1 indicates Skip-Gram, sg=0 indicates CBOW)
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=1)

# Predicting words similar to 'fox'
similar_words = model.wv.most_similar('fox')
print(similar_words)

Related Posts

Vector Representations of Words

One of the most significant advancements in the field of Natural Language Processing (NLP) over the past decade has been the development and adoption of vector representations…

Unigram Models

A Unigram model is a type of language model that considers each token to be independent of the tokens before it. It’s the simplest language model, in…

Bigram Models

A Bigram model is a language model in which we predict the probability of the correctness of a sequence of words by just predicting the occurrence of the…

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of different words in a sentence. It’s useful in text classification and for helping…

NLP pipeline Step By Step

In Natural Language Processing (NLP), an NLP pipeline is a sequence of interconnected steps that systematically transform raw text data into a desired output suitable for further…

Lemmatization

Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. In lemmatization, rather…

Leave a Reply

Your email address will not be published. Required fields are marked *