Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of different words in a sentence. It’s useful in text classification and for helping a machine learning model read words.

Here’s a detailed explanation and a step-by-step guide on how TF-IDF works:

Components of TF-IDF

  1. Term Frequency (TF):
    • Definition: Measures the frequency of a term in a document.
    • Formula: TF(t,d)=Number of times term t appears in document dTotal number of terms in document dTF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}TF(t,d)=Total number of terms in document dNumber of times term t appears in document d​
  2. Inverse Document Frequency (IDF):
    • Definition: Measures how important a term is across all documents in the corpus.
    • Formula: IDF(t,D)=log⁡(Total number of documents in corpus DNumber of documents containing term t)IDF(t, D) = \log \left( \frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t} \right)IDF(t,D)=log(Number of documents containing term tTotal number of documents in corpus D​)
    • If a term appears in many documents, its IDF value will be low.
  3. TF-IDF:
    • Definition: Combines TF and IDF to give a weight to each term in a document, highlighting terms that are important to the document but not too common across all documents.
    • Formula: TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

Example

Consider a corpus of three documents:

  1. “The cat sat on the mat”
  2. “The cat is my friend”
  3. “The dog is my friend”

Let’s calculate TF-IDF for the term “cat”.

  1. TF Calculation:
    • Document 1: “The cat sat on the mat”
      • TF(cat, Document 1) = 1/6
    • Document 2: “The cat is my friend”
      • TF(cat, Document 2) = 1/6
    • Document 3: “The dog is my friend”
      • TF(cat, Document 3) = 0/6 = 0
  2. IDF Calculation:
    • Total number of documents (D) = 3
    • Number of documents containing “cat” = 2 IDF(cat,D)=log⁡(32)=log⁡(1.5)≈0.176IDF(cat, D) = \log \left( \frac{3}{2} \right) = \log(1.5) \approx 0.176IDF(cat,D)=log(23​)=log(1.5)≈0.176
  3. TF-IDF Calculation:
    • Document 1: TF-IDF(cat,Document1)=16×0.176≈0.029TF\text{-}IDF(cat, Document 1) = \frac{1}{6} \times 0.176 \approx 0.029TF-IDF(cat,Document1)=61​×0.176≈0.029
    • Document 2: TF-IDF(cat,Document2)=16×0.176≈0.029TF\text{-}IDF(cat, Document 2) = \frac{1}{6} \times 0.176 \approx 0.029TF-IDF(cat,Document2)=61​×0.176≈0.029
    • Document 3: TF-IDF(cat,Document3)=0×0.176=0TF\text{-}IDF(cat, Document 3) = 0 \times 0.176 = 0TF-IDF(cat,Document3)=0×0.176=0

Related Posts

Vector Representations of Words

One of the most significant advancements in the field of Natural Language Processing (NLP) over the past decade has been the development and adoption of vector representations…

Unigram Models

A Unigram model is a type of language model that considers each token to be independent of the tokens before it. It’s the simplest language model, in…

Word2Vec

Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec…

Bigram Models

A Bigram model is a language model in which we predict the probability of the correctness of a sequence of words by just predicting the occurrence of the…

NLP pipeline Step By Step

In Natural Language Processing (NLP), an NLP pipeline is a sequence of interconnected steps that systematically transform raw text data into a desired output suitable for further…

Lemmatization

Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. In lemmatization, rather…

Leave a Reply

Your email address will not be published. Required fields are marked *