Unigram Models

A Unigram model is a type of language model that considers each token to be independent of the tokens before it. It’s the simplest language model, in the sense that the probability of token X given the previous context is just the probability of token X. So, if we used a Unigram language model to generate text, we would always predict the most common token.

Features of Unigram Models:

  1. Independence Assumption:
    • The Unigram model assumes that each word in a sentence or document is independent of the others. This means the presence of a word does not influence the presence of another word.
  2. Word Probability:
    • The probability of a word wiwi​ in a corpus is estimated by the frequency of that word in the corpus. The more frequently a word appears, the higher its probability.
    P(wi)=Count(wi)Total Number of Words in CorpusP(wi​)=Total Number of Words in CorpusCount(wi​)​
  3. Sequence Probability:
    • The probability of a sequence of words (a sentence or document) is the product of the probabilities of the individual words.
    P(w1,w2,…,wn)=P(w1)×P(w2)×…×P(wn)P(w1​,w2​,…,wn​)=P(w1​)×P(w2​)×…×P(wn​)

Example:

Consider a simple corpus: "I love AI. AI is amazing."

  1. Word Frequencies:
    • “I” appears 1 time.
    • “love” appears 1 time.
    • “AI” appears 2 times.
    • “is” appears 1 time.
    • “amazing” appears 1 time.
  2. Total Number of Words: 6
  3. Word Probabilities:
    • P(“I”)=16P(“I”)=61​
    • P(“love”)=16P(“love”)=61​
    • P(“AI”)=26=13P(“AI”)=62​=31​
    • P(“is”)=16P(“is”)=61​
    • P(“amazing”)=16P(“amazing”)=61​
  4. Sentence Probability:
    • For the sentence "I love AI", the probability under the Unigram model is:
    P(“I love AI”)=P(“I”)×P(“love”)×P(“AI”)P(“I love AI”)=P(“I”)×P(“love”)×P(“AI”)P(“I love AI”)=16×16×13=1108P(“I love AI”)=61​×61​×31​=1081​

Advantages:

  • Simplicity: The Unigram model is very simple to implement and understand. It requires only the computation of word frequencies.
  • Baseline Model: It often serves as a baseline in more complex language modeling tasks.

Limitations:

  • No Context: The Unigram model ignores the order and context of words, which is crucial in understanding natural language. For example, it treats “I love AI” and “AI love I” as having the same probability.
  • Poor Performance in Practice: Due to its lack of consideration for context, the Unigram model often performs poorly on tasks like speech recognition, machine translation, and text generation compared to more sophisticated models like bigram, trigram, or neural network-based models.

Applications:

  • Text Classification: In some cases, Unigram models can be used for text classification tasks where the presence or absence of words is more important than their order.
  • Information Retrieval: Unigram models can be used in information retrieval systems, where documents are ranked based on the occurrence of keywords.
  • Baseline Comparisons: In language modeling and NLP, Unigram models are often used as a simple baseline to compare against more advanced models.
from collections import Counter

# Example corpus
corpus = "I love AI. AI is amazing.".lower().split()

# Calculate word frequencies
word_freq = Counter(corpus)
total_words = sum(word_freq.values())

# Calculate word probabilities
unigram_probs = {word: freq / total_words for word, freq in word_freq.items()}

# Example: Calculate the probability of a sentence
sentence = "I love AI".lower().split()
sentence_prob = 1
for word in sentence:
    sentence_prob *= unigram_probs.get(word, 0)

print(f"Sentence Probability: {sentence_prob}")

Related Posts

Count vectorization

Count Vectorization is a technique used in Natural Language Processing (NLP) to convert text into numerical feature vectors. It represents text data as a matrix where each…

Vector Representations of Words

One of the most significant advancements in the field of Natural Language Processing (NLP) over the past decade has been the development and adoption of vector representations…

Word2Vec

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec…

Bigram Models

A Bigram model is a language model in which we predict the probability of the correctness of a sequence of words by just predicting the occurrence of the…

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of different words in a sentence. It’s useful in text classification and for helping…

NLP pipeline Step By Step

In Natural Language Processing (NLP), an NLP pipeline is a sequence of interconnected steps that systematically transform raw text data into a desired output suitable for further…

Leave a Reply

Your email address will not be published. Required fields are marked *