A Unigram model is a type of language model that considers each token to be independent of the tokens before it. It’s the simplest language model, in the sense that the probability of token X given the previous context is just the probability of token X. So, if we used a Unigram language model to generate text, we would always predict the most common token.
Features of Unigram Models:
- Independence Assumption:
- The Unigram model assumes that each word in a sentence or document is independent of the others. This means the presence of a word does not influence the presence of another word.
- Word Probability:
- The probability of a word wiwi in a corpus is estimated by the frequency of that word in the corpus. The more frequently a word appears, the higher its probability.
- Sequence Probability:
- The probability of a sequence of words (a sentence or document) is the product of the probabilities of the individual words.
Example:
Consider a simple corpus: "I love AI. AI is amazing."
- Word Frequencies:
- “I” appears 1 time.
- “love” appears 1 time.
- “AI” appears 2 times.
- “is” appears 1 time.
- “amazing” appears 1 time.
- Total Number of Words: 6
- Word Probabilities:
- P(“I”)=16P(“I”)=61
- P(“love”)=16P(“love”)=61
- P(“AI”)=26=13P(“AI”)=62=31
- P(“is”)=16P(“is”)=61
- P(“amazing”)=16P(“amazing”)=61
- Sentence Probability:
- For the sentence
"I love AI"
, the probability under the Unigram model is:
- For the sentence
Advantages:
- Simplicity: The Unigram model is very simple to implement and understand. It requires only the computation of word frequencies.
- Baseline Model: It often serves as a baseline in more complex language modeling tasks.
Limitations:
- No Context: The Unigram model ignores the order and context of words, which is crucial in understanding natural language. For example, it treats “I love AI” and “AI love I” as having the same probability.
- Poor Performance in Practice: Due to its lack of consideration for context, the Unigram model often performs poorly on tasks like speech recognition, machine translation, and text generation compared to more sophisticated models like bigram, trigram, or neural network-based models.
Applications:
- Text Classification: In some cases, Unigram models can be used for text classification tasks where the presence or absence of words is more important than their order.
- Information Retrieval: Unigram models can be used in information retrieval systems, where documents are ranked based on the occurrence of keywords.
- Baseline Comparisons: In language modeling and NLP, Unigram models are often used as a simple baseline to compare against more advanced models.
from collections import Counter
# Example corpus
corpus = "I love AI. AI is amazing.".lower().split()
# Calculate word frequencies
word_freq = Counter(corpus)
total_words = sum(word_freq.values())
# Calculate word probabilities
unigram_probs = {word: freq / total_words for word, freq in word_freq.items()}
# Example: Calculate the probability of a sentence
sentence = "I love AI".lower().split()
sentence_prob = 1
for word in sentence:
sentence_prob *= unigram_probs.get(word, 0)
print(f"Sentence Probability: {sentence_prob}")