Count Vectorization is a technique used in Natural Language Processing (NLP) to convert text into numerical feature vectors. It represents text data as a matrix where each row corresponds to a document, and each column represents a unique word from the corpus. The values in the matrix indicate the count of each word in the respective document.
How it works
- Tokenizes the text data
- Counts the number of times each token appears
- Creates a matrix where each row is a document and each column is a token
- The values in the matrix indicate how many times each word appears in each document
Example
Consider the following two documents:
- Doc 1: “AI is amazing”
- Doc 2: “AI is powerful and amazing”
After applying Count Vectorization, we get:
Document | AI | is | amazing | powerful | and |
---|---|---|---|---|---|
Doc 1 | 1 | 1 | 1 | 0 | 0 |
Doc 2 | 1 | 1 | 1 | 1 | 1 |
What it’s used for
- Computing the count of unique words across a number of texts
- Extracting and representing features from text data
- As input to machine learning algorithms
Features
- Stop word removal
- Word count thresholds
- Vocab limits
- N-gram creation
- Custom preprocessing
- Custom tokenization
Limitations
- Assumes all words are independent of each other, ignoring any sense of order or context