Bigram Models

A Bigram model is a language model in which we predict the probability of the correctness of a sequence of words by just predicting the occurrence of the word “a” after the word “b”.

Example:

Consider the following sentence:

  • “I love machine learning.”

To build a bigram model from this sentence, we break it down into bigrams (pairs of words):

  • (“I”, “love”)
  • (“love”, “machine”)
  • (“machine”, “learning”)

How the Bigram Model Works:

  1. Training:
    • Suppose you have a corpus of text and you want to train a bigram model. You would count how often each bigram appears in the text.
    • For instance, in a large corpus, you might find that “I love” appears 100 times, “love machine” appears 50 times, and so on.
  2. Probability Calculation:
    • The bigram model calculates the probability of a word given the previous word.
    • For example, the probability of “love” given “I” would be:
    • P(love|I)=Count(I love)/Count(I)
    • If “I love” appears 100 times and “I” appears 200 times in the corpus, then:
    • P(love|I)=100/200=0.5
  3. Sentence Generation:
    • To generate a new sentence, the model starts with an initial word and uses the bigram probabilities to predict the next word.

Practical Example:

Given a small corpus:

  • “I love machine learning.”
  • “I love coding.”
  • “Coding is fun.”

Bigrams and their counts:

  • (“I”, “love”): 2
  • (“love”, “machine”): 1
  • (“machine”, “learning”): 1
  • (“love”, “coding”): 1
  • (“coding”, “is”): 1
  • (“is”, “fun”): 1

Using these counts, you can calculate the probabilities for each bigram, which the model uses to predict or generate text.

Related Posts

Vector Representations of Words

One of the most significant advancements in the field of Natural Language Processing (NLP) over the past decade has been the development and adoption of vector representations…

Unigram Models

A Unigram model is a type of language model that considers each token to be independent of the tokens before it. It’s the simplest language model, in…

Word2Vec

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec…

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of different words in a sentence. It’s useful in text classification and for helping…

NLP pipeline Step By Step

In Natural Language Processing (NLP), an NLP pipeline is a sequence of interconnected steps that systematically transform raw text data into a desired output suitable for further…

Lemmatization

Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. In lemmatization, rather…

Leave a Reply

Your email address will not be published. Required fields are marked *