Stemming is a text preprocessing technique used in natural language processing (NLP) to reduce words to their root or base form. The goal of stemming is to simplify and standardize words, which helps improve the performance of information retrieval, text classification, and other NLP tasks. By transforming words to their stems, NLP models can treat different forms of the same word as a single entity, reducing the complexity of the text data.
Common Stemming Algorithms:
Porter Stemmer:
- One of the most widely used stemming algorithms.
- Uses a series of rules to iteratively remove suffixes from words.
- Example: “running” becomes “run”, “happiness” becomes “happi”.
The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped onto the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree.’
Lovins Stemmer
Lovins proposed this algorithm in 1968, which removes the longest suffix from a word, then the word is recoded to convert this stem into valid words.
Example: sitting -> sitt -> sit
Characteristics of the Lovins Stemmer
- Suffix List: The stemmer uses a predefined list of suffixes, which are matched and removed from words.
- Rules for Removal: It includes rules to determine when a suffix should be removed and how to handle the resulting stem.
Exceptions Handling: There are rules for dealing with exceptions and irregular forms to avoid incorrect stemming.
class LovinsStemmer:
def __init__(self):
self.suffixes = {
'izing': '',
'ization': '',
'ational': '',
'fulness': '',
'ousness': '',
'iveness': '',
'tional': '',
'biliti': 'ble',
'lessli': 'less',
'entli': 'ent',
'ation': '',
'alism': '',
'aliti': '',
'arism': '',
'iviti': '',
'ical': '',
'ness': '',
'ance': '',
'ence': '',
'able': '',
'ible': '',
'ment': '',
'ent': '',
'ism': '',
'ate': '',
'iti': '',
'ous': '',
'ive': '',
'ize': '',
'al': '',
'er': '',
'ic': ''
}
def stem(self, word):
for suffix, replacement in self.suffixes.items():
if word.endswith(suffix):
return word[:-len(suffix)] + replacement
return word
# Example usage
lovins_stemmer = LovinsStemmer()
# Sample text
words = ["running", "happiness", "nationalization", "organization", "reliable", "joyful"]
# Stem words using Lovins Stemmer
stemmed_words = [lovins_stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
Output Explanation
For the given list of words, the output might be:
- Original Words: [‘running’, ‘happiness’, ‘nationalization’, ‘organization’, ‘reliable’, ‘joyful’]
- Stemmed Words: [‘run’, ‘happy’, ‘national’, ‘organize’, ‘reliable’, ‘joy’]
Strengths and Weaknesses of the Lovins Stemmer
Strengths:
- Early Influence: One of the pioneering algorithms in the field of stemming, influencing later stemmers.
- Rule-Based Approach: The predefined rules can be quite effective for many common English words.
Weaknesses:
- Complexity: The large number of rules can make it more complex to implement and maintain.
- Over-Stemming and Under-Stemming: As with many rule-based stemmers, it can sometimes produce stems that are either too short (over-stemming) or fail to stem adequately (under-stemming).
Lack of Context Sensitivity: It does not consider the context in which words appear, which can lead to incorrect stems.
Dawson Stemmer
The Dawson Stemmer is an extension of the Lovins stemmer. In the Dawson Stemmer, suffixes are stored in the reversed order indexed by their length and last letter.
Characteristics of the Dawson Stemmer
- Extensive Suffix List: The Dawson Stemmer includes a very comprehensive list of suffixes, far more extensive than the Lovins Stemmer.
- Rules and Conditions: It applies a set of rules and conditions to determine the appropriate removal of suffixes.
- Dictionary Lookup: Often involves looking up words in a dictionary to confirm the validity of stems.
- Hierarchical Removal: Suffixes are removed in a hierarchical manner, ensuring that more common suffixes are handled before rarer ones.
Implementing the Dawson Stemmer
class DawsonStemmer:
def __init__(self):
# A simplified list of suffixes for demonstration purposes
self.suffixes = {
'ization': 'ize',
'ational': 'ate',
'fulness': 'ful',
'ousness': 'ous',
'iveness': 'ive',
'tional': 'tion',
'biliti': 'ble',
'lessli': 'less',
'entli': 'ent',
'ation': 'ate',
'alism': 'al',
'aliti': 'al',
'arism': 'ar',
'iviti': 'ive',
'ical': 'ic',
'ness': '',
'ance': '',
'ence': '',
'able': '',
'ible': '',
'ment': '',
'ent': '',
'ism': '',
'ate': '',
'iti': '',
'ous': '',
'ive': '',
'ize': '',
'al': '',
'er': '',
'ic': ''
}
def stem(self, word):
for suffix, replacement in self.suffixes.items():
if word.endswith(suffix):
return word[:-len(suffix)] + replacement
return word
# Example usage
dawson_stemmer = DawsonStemmer()
# Sample text
words = ["organization", "nationalization", "happiness", "running", "joyfulness", "activation"]
# Stem words using Dawson Stemmer
stemmed_words = [dawson_stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
class DawsonStemmer:
def __init__(self):
# A simplified list of suffixes for demonstration purposes
self.suffixes = {
'ization': 'ize',
'ational': 'ate',
'fulness': 'ful',
'ousness': 'ous',
'iveness': 'ive',
'tional': 'tion',
'biliti': 'ble',
'lessli': 'less',
'entli': 'ent',
'ation': 'ate',
'alism': 'al',
'aliti': 'al',
'arism': 'ar',
'iviti': 'ive',
'ical': 'ic',
'ness': '',
'ance': '',
'ence': '',
'able': '',
'ible': '',
'ment': '',
'ent': '',
'ism': '',
'ate': '',
'iti': '',
'ous': '',
'ive': '',
'ize': '',
'al': '',
'er': '',
'ic': ''
}
def stem(self, word):
for suffix, replacement in self.suffixes.items():
if word.endswith(suffix):
return word[:-len(suffix)] + replacement
return word
# Example usage
dawson_stemmer = DawsonStemmer()
# Sample text
words = ["organization", "nationalization", "happiness", "running", "joyfulness", "activation"]
# Stem words using Dawson Stemmer
stemmed_words = [dawson_stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
Output Explanation
For the given list of words, the output might be:
- Original Words: [‘organization’, ‘nationalization’, ‘happiness’, ‘running’, ‘joyfulness’, ‘activation’]
Stemmed Words: [‘organize’, ‘nationalize’, ‘happi’, ‘running’, ‘joyful’, ‘activate’]
Strengths and Weaknesses of the Dawson Stemmer
Strengths:
- Comprehensive: The extensive suffix list and hierarchical approach provide more accurate stemming.
- Detailed Rules: The detailed rules can help in achieving better precision in various contexts.
Weaknesses:
- Complexity: The large number of rules and suffixes make it more complex to implement and maintain.
- Performance: The extensive processing can impact performance, especially on large datasets.
- Over-Stemming and Under-Stemming: Despite the comprehensive rules, there can still be cases of over-stemming (e.g., reducing a word too much) or under-stemming (e.g., not reducing a word enough).
Krovetz Stemmer
This stemming algorithm was proposed in 1993 by Robert Krovetz. Here are the steps that the Krovetz Stemmer follows:
- Convert the plural form of a word to its singular form.
- Convert the past tense of a word to its present tense and remove the suffix ‘ing.’
Example: ‘children’ -> ‘child’
Characteristics of the Krovetz Stemmer
- Hybrid Approach: Combines rule-based stemming with dictionary lookups to ensure valid lemmas.
- Context Sensitivity: More sensitive to context than traditional stemmers, aiming to produce meaningful stems.
- Linguistically Accurate: Focuses on maintaining the linguistic integrity of words, reducing over-stemming and under-stemming.
- Performance: Efficient in terms of both speed and accuracy, making it suitable for information retrieval tasks.
Key Features
- Dictionary Lookup: The stemmer uses a dictionary to verify the validity of stems.
- Rule-Based Reduction: If a word is not found in the dictionary, it applies a series of transformation rules.
Iterative Process: Continuously checks and refines the stem until a valid form is found.
Implementation:
class KrovetzStemmer:
def __init__(self, dictionary):
self.dictionary = dictionary
self.suffixes = ['ing', 'ed', 'ness', 'ly', 's', 'es', 'er', 'ation', 'ment', 'able', 'ible', 'al', 'ic']
def is_valid_word(self, word):
return word in self.dictionary
def stem_word(self, word):
if self.is_valid_word(word):
return word
for suffix in self.suffixes:
if word.endswith(suffix):
potential_stem = word[:-len(suffix)]
if self.is_valid_word(potential_stem):
return potential_stem
elif suffix == 'ing' and len(potential_stem) > 1 and potential_stem[-1] == potential_stem[-2]:
potential_stem = potential_stem[:-1]
if self.is_valid_word(potential_stem):
return potential_stem
return word
# Example usage
dictionary = {"run", "happy", "organize", "reliable", "joy", "activate"}
krovetz_stemmer = KrovetzStemmer(dictionary)
# Sample text
words = ["running", "happiness", "nationalization", "organization", "reliable", "joyful"]
# Stem words using Krovetz Stemmer
stemmed_words = [krovetz_stemmer.stem_word(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
Output Explanation
For the given list of words, the output might be:
- Original Words: [‘running’, ‘happiness’, ‘nationalization’, ‘organization’, ‘reliable’, ‘joyful’]
- Stemmed Words: [‘run’, ‘happy’, ‘organize’, ‘organize’, ‘reliable’, ‘joy’]
Strengths and Weaknesses of the Krovetz Stemmer
Strengths:
- Linguistic Accuracy: Produces more linguistically accurate stems compared to traditional rule-based stemmers.
- Reduced Over-Stemming: Avoids the problem of over-stemming by verifying stems against a dictionary.
- Context Sensitivity: Takes the context into account, ensuring that the stems are meaningful.
Weaknesses:
- Dictionary Dependency: Relies heavily on the quality and comprehensiveness of the dictionary used.
- Complexity: More complex to implement compared to simpler stemmers like Porter or Snowball.
Limited Language Support: Primarily designed for English, with limited support for other languages.
N-Gram Stemmer
An N-gram is a contiguous sequence of N items from a given text. In the context of an N-Gram Stemmer, these items are typically characters. For instance:
- A 2-gram (bigram) of “running” would be: “ru”, “un”, “nn”, “ni”, “in”, “ng”.
- A 3-gram (trigram) of “running” would be: “run”, “unn”, “nni”, “nin”, “ing”.
N-Gram Counts
To understand the stems of words, we analyze the frequency of these N-grams in a corpus of text. Here’s how we can proceed:
- Generate N-Grams: Generate all possible N-grams for each word in the corpus.
- Count N-Grams: Count the frequency of each N-gram across the entire corpus.
- Identify Common N-Grams: Identify the most frequent N-grams which likely indicate common substrings or stems.
- Stem Words: Use these common N-grams to identify and extract stems from words.
Characteristics of an N-Gram Stemmer
- Based on N-Grams: Uses sequences of characters (usually) to determine the stem of a word.
- Contextual: Can be more context-aware than simple rule-based stemmers by considering common substrings across different words.
- Statistical: Often relies on statistical methods to determine the most likely stem based on frequency and distribution of N-grams.
How it Works
- Generate N-Grams: Create N-grams from the words in a corpus.
- Frequency Analysis: Analyze the frequency of these N-grams to find common substrings that appear in multiple words.
Identify Stems: Use these common substrings to identify and extract stems from words.
Implementation:
from collections import defaultdict, Counter
class NGramStemmer:
def __init__(self, n=3):
self.n = n
self.ngrams = defaultdict(list)
self.stems = {}
def generate_ngrams(self, word):
ngrams = [word[i:i+self.n] for i in range(len(word) - self.n + 1)]
return ngrams
def fit(self, corpus):
for word in corpus:
ngrams = self.generate_ngrams(word)
for ngram in ngrams:
self.ngrams[ngram].append(word)
# Determine stems based on the most common n-gram associations
for ngram, words in self.ngrams.items():
if len(words) > 1:
common_prefix = self.longest_common_prefix(words)
for word in words:
self.stems[word] = common_prefix
def longest_common_prefix(self, words):
if not words:
return ""
shortest_word = min(words, key=len)
for i, char in enumerate(shortest_word):
for other in words:
if other[i] != char:
return shortest_word[:i]
return shortest_word
def stem(self, word):
if word in self.stems:
return self.stems[word]
return word
# Example usage
corpus = ["running", "runner", "ran", "runs", "happiness", "happy", "unhappily", "organization", "organize"]
ngram_stemmer = NGramStemmer(n=3)
ngram_stemmer.fit(corpus)
# Sample words
words = ["running", "happiness", "organization", "unhappily"]
# Stem words using N-Gram Stemmer
stemmed_words = [ngram_stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
Output Explanation
For the given list of words, the output might be:
- Original Words: [‘running’, ‘happiness’, ‘organization’, ‘unhappily’]
Stemmed Words: [‘run’, ‘happi’, ‘organ’, ‘unhappi’]
Strengths and Weaknesses of N-Gram Stemmer
Strengths:
- Context-Aware: By considering common substrings, it can be more context-aware than simple rule-based stemmers.
- Flexible: Can handle a variety of word forms by focusing on statistical patterns in the data.
- Language-Independent: Does not rely on language-specific rules, making it adaptable to multiple languages.
Weaknesses:
- Complexity: Requires sufficient data and computational resources to generate and analyze N-grams.
- Over-Stemming: May sometimes over-stem by focusing on overly common substrings.
Under-Stemming: May fail to find a common stem if the substrings are not sufficiently similar.
Snowball Stemmer
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages, the Snowball Stemmers can be referred to as a multilingual stemmer. The Snowball stemmers are also imported from the NLTK package.
The Snowball Stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed.
Characteristics of the Snowball Stemmer
Snowball Stemmer
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages, the Snowball Stemmers can be referred to as a multilingual stemmer. The Snowball stemmers are also imported from the NLTK package.
The Snowball Stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed.
Characteristics of the Snowball Stemmer
i) Improved Algorithm: Enhances the rules and procedures of the original Porter Stemmer to improve accuracy.
ii) Language Support: Designed to be easily adaptable to multiple languages.
iii) Rule-Based: Utilizes a set of predefined rules for suffix removal.
iv) Efficient: Provides a good balance between computational efficiency and stemming accuracy.
How it Works
The Snowball Stemmer works by applying a series of transformations to a word, typically involving the removal of common suffixes. The process involves multiple steps, with specific rules for each step to ensure that the word is correctly reduced to its stem.
Implementation
pip install nltk
import nltk
from nltk.stem.snowball import SnowballStemmer
# Initialize the Snowball Stemmer for English
stemmer = SnowballStemmer("english")
# Sample words
words = ["running", "happiness", "organization", "unhappily", "activate", "relational", "conditional"]
# Stem words using Snowball Stemmer
stemmed_words = [stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
Output Explanation:
For the given list of words, the output might be:
- Original Words: [‘running’, ‘happiness’, ‘organization’, ‘unhappily’, ‘activate’, ‘relational’, ‘conditional’]
- Stemmed Words: [‘run’, ‘happi’, ‘organ’, ‘unhappili’, ‘activ’, ‘relat’, ‘condit’]
Key Features of the Snowball Stemmer
- Stepwise Reduction: Words are reduced in multiple steps, each designed to handle specific types of suffixes.
- Portability: The algorithm is designed to be easily adapted to different languages by modifying the set of rules.
- Efficiency: Strikes a balance between the complexity of the rules and the computational efficiency.
Strengths and Weaknesses of the Snowball Stemmer
Strengths:
- Improved Accuracy: More accurate than the original Porter Stemmer due to refined rules.
- Language Support: Supports multiple languages, making it versatile for different text corpora.
- Efficiency: Efficient in terms of computational resources, making it suitable for large datasets.
Weaknesses:
- Rule Complexity: The complexity of the rules can make it difficult to understand and modify.
Over-Stemming and Under-Stemming: Like any rule-based stemmer, it can still produce over-stemming or under-stemming in some cases.
6. Lancaster Stemmer
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is faster, but the algorithm is confusing when dealing with small words. The drawback is that it is not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and uses an iterative algorithm.
Characteristics of the Lancaster Stemmer
i) Iterative Process: Applies rules repeatedly until no more rules can be applied.
ii) Aggressive Stemming: Often produces shorter stems, which can lead to over-stemming.
iii) Rule-Based: Utilizes a predefined set of rules for suffix stripping.
iv) Fast and Efficient: Known for its speed and simplicity.
How it Works
The Lancaster Stemmer works by applying a series of transformation rules to a word, with each rule specifying a suffix to remove and sometimes a replacement string. These rules are applied iteratively until no more rules can be applied to the word. This iterative process continues to shorten the word step-by-step.
Implementation:
Using NLTK
NLTK library installation:
pip install nltk
import nltk
from nltk.stem import LancasterStemmer
# Initialize the Lancaster Stemmer
stemmer = LancasterStemmer()
# Sample words
words = ["running", "happiness", "organization", "unhappily", "activate", "relational", "conditional"]
# Stem words using Lancaster Stemmer
stemmed_words = [stemmer.stem(word) for word in words]
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
Output Explanation
For the given list of words, the output might be:
- Original Words: [‘running’, ‘happiness’, ‘organization’, ‘unhappily’, ‘activate’, ‘relational’, ‘conditional’]
- Stemmed Words: [‘run’, ‘happy’, ‘org’, ‘unhappy’, ‘activ’, ‘relat’, ‘condit’]
Key Features of the Lancaster Stemmer
- Rule Application: Applies a comprehensive set of rules to strip suffixes from words.
- Aggressive Stemming: Known for its aggressive approach, which can lead to over-stemming where words are reduced more than necessary.
- Iterative Nature: Continues to apply rules iteratively until no more rules can be applied, ensuring thorough stemming.
Example Rules
Here are some examples of the types of rules used in the Lancaster Stemmer:
- Remove “ing” if the resulting word length is greater than 3: running -> run
- Remove “ness” if the resulting word length is greater than 4: happiness -> happy
- Remove “ation” if the resulting word length is greater than 4: organization -> organ
- Remove “ly” if the resulting word length is greater than 2: unhappily -> unhappy
Strengths and Weaknesses of the Lancaster Stemmer
Strengths:
- Efficiency: The iterative and rule-based approach makes it computationally efficient.
- Simplicity: Easy to implement and understand due to its straightforward rule application.
- Aggressive Stemming: Can be beneficial for certain applications where aggressive reduction of words is desired.
Weaknesses:
- Over-Stemming: The aggressive nature can lead to over-stemming, where words are reduced too much, potentially losing meaningful distinctions.
- Less Linguistic Accuracy: Compared to stemmers like Porter or Snowball, the Lancaster Stemmer may produce stems that are less linguistically accurate.
Limited Language Support: Primarily designed for English and may not be as easily adapted to other languages.