TF-IDF is a natural language processing (NLP) technique that’s used to evaluate the importance of different words in a sentence. It’s useful in text classification and for helping a machine learning model read words.
Here’s a detailed explanation and a step-by-step guide on how TF-IDF works:
Components of TF-IDF
- Term Frequency (TF):
- Definition: Measures the frequency of a term in a document.
- Formula: TF(t,d)=Number of times term t appears in document dTotal number of terms in document dTF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}TF(t,d)=Total number of terms in document dNumber of times term t appears in document d
- Inverse Document Frequency (IDF):
- Definition: Measures how important a term is across all documents in the corpus.
- Formula: IDF(t,D)=log(Total number of documents in corpus DNumber of documents containing term t)IDF(t, D) = \log \left( \frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t} \right)IDF(t,D)=log(Number of documents containing term tTotal number of documents in corpus D)
- If a term appears in many documents, its IDF value will be low.
- TF-IDF:
- Definition: Combines TF and IDF to give a weight to each term in a document, highlighting terms that are important to the document but not too common across all documents.
- Formula: TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
Example
Consider a corpus of three documents:
- “The cat sat on the mat”
- “The cat is my friend”
- “The dog is my friend”
Let’s calculate TF-IDF for the term “cat”.
- TF Calculation:
- Document 1: “The cat sat on the mat”
- TF(cat, Document 1) = 1/6
- Document 2: “The cat is my friend”
- TF(cat, Document 2) = 1/6
- Document 3: “The dog is my friend”
- TF(cat, Document 3) = 0/6 = 0
- Document 1: “The cat sat on the mat”
- IDF Calculation:
- Total number of documents (D) = 3
- Number of documents containing “cat” = 2 IDF(cat,D)=log(32)=log(1.5)≈0.176IDF(cat, D) = \log \left( \frac{3}{2} \right) = \log(1.5) \approx 0.176IDF(cat,D)=log(23)=log(1.5)≈0.176
- TF-IDF Calculation:
- Document 1: TF-IDF(cat,Document1)=16×0.176≈0.029TF\text{-}IDF(cat, Document 1) = \frac{1}{6} \times 0.176 \approx 0.029TF-IDF(cat,Document1)=61×0.176≈0.029
- Document 2: TF-IDF(cat,Document2)=16×0.176≈0.029TF\text{-}IDF(cat, Document 2) = \frac{1}{6} \times 0.176 \approx 0.029TF-IDF(cat,Document2)=61×0.176≈0.029
- Document 3: TF-IDF(cat,Document3)=0×0.176=0TF\text{-}IDF(cat, Document 3) = 0 \times 0.176 = 0TF-IDF(cat,Document3)=0×0.176=0