Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Once an NER learning model has been trained on textual data and entity types, it automatically analyzes new unstructured text, categorizing named entities and semantic meaning based on its training.
Key concepts of NER
Named entities are not the only concept to understand within the NER world. Several other terms should be explained to understand the topic better.
POS tagging. Standing for “part-of-speech tagging,” this process assigns labels to words in a text corresponding to their specific part of speech, such as adjectives, verbs, or nouns.
Corpus. This is a collection of texts used for linguistic analysis and training NER models. A corpus can range from a set of news articles to academic journals or even social media posts.
Chunking. This is an NLP technique that groups individual words or phrases into “chunks” based on their syntactic roles, creating meaningful clusters like noun phrases or verb phrases.
Word embeddings. These are dense vector representations of words, capturing their semantic meanings. Word embeddings translate words or phrases into numerical vectors of fixed size, making it easier for machine learning models to process. Tools like Word2Vec and GloVe are popular for generating such embeddings, and they help in understanding the context and relationships between words in a text.
Examples
Input:
Red Bull Racing Honda, the four-time Formula-1 World
Champion team, has chosen Oracle Cloud Infrastructure
(OCI) as their infrastructure partner.
Output:
Red Bull Racing Honda [ORG] 1.0000
four-time [QUANTITY/NUMBER] 1.0000
Formula-1 World [EVENT] 0.9705
Oracle Cloud Infrastructure (OCI [ORG] 0.9811
Input:
Tesla’s battery innovations in 2023
Output:
“Tesla” is a company, and “2023” is a specific year.
Applications:
Information retrieval
Information retrieval is the process of obtaining information, often from large databases, which is relevant to a specific query or need.
Let’s take the realm of search engines, for example, where NER is used to elevate the precision of search results. For instance, if a researcher inputs “Tesla’s battery innovations in 2023” into a database or search engine, NER discerns that “Tesla” is a company, and “2023” is a specific year. Such a distinction ensures the results fetched are directly about Tesla’s battery-related advancements in 2023, omitting unrelated articles about Tesla or generic battery innovations.
Content recommendation
Recommender systems suggest relevant content to users based on their behavior, preferences, and interaction history.
Modern content platforms, from news websites to streaming services like Netflix, harness NER to fine-tune their recommendation algorithms. Suppose a user reads several articles about “sustainable travel trends.” NER identifies “sustainable travel” as a distinct topic, prompting the recommendation engine to suggest more articles or documentaries in that niche, providing a tailored content experience for the user.
Automated data entry
Robotic Process Automation (RPA) refers to software programs replicating human actions to perform routine business tasks. While these programs aren’t related to hardware robots, they function like regular white-collar workers.
Sentiment analysis enhancement
Sentiment analysis is a technique that combines statistics, NLP, and machine learning to detect and extract subjective content from text. This could include a reviewer’s emotions, opinions, or evaluations regarding a specific topic, event, or the actions of a company.
How NER works
Fundamentally, NER revolves around two primary steps:
- identifying entities within the text and
- categorizing these entities into distinct groups.
Entity detection
Entity detection, often called mention detection or named entity identification, is the initial and fundamental phase in the NER process. It involves systematically scanning and identifying chunks of text that potentially represent meaningful entities.
Tokenization. At its most basic, a document or sentence is just a long string of characters. Tokenization is the process of breaking this string into meaningful pieces called tokens. In English, tokens are often equivalent to words but can also represent punctuation or other symbols. That kind of segmentation simplifies the subsequent analytical steps by converting the text into manageable units.
Feature extraction. Simply splitting the text isn’t enough. The next challenge is to understand the significance of these tokens. This is where feature extraction comes into play.
It involves analyzing the properties of tokens, such as:
- morphological features that deal with the form of words, like their root forms or prefixes;
- syntactic features that focus on the arrangement and relationships of words in sentences; and
- semantic features that capture the inherent meaning of words and can sometimes tap into broader world knowledge or context to better understand a token’s role.
The subsequent phase in the named entity recognition process, following entity detection, is entity classification.
Entity classification
Entity classification involves assigning the identified entities to specific categories or classes based on their semantic significance and context. These categories can range from person and organization to location, date, and myriad other labels depending on the application’s requirements.
A nuanced process, entity classification demands a keen understanding of the context in which entities appear. This classification leverages linguistic, statistical, and sometimes domain-specific knowledge. For example, while “Apple” in a tech sphere might refer to the technology company, in a culinary context, it’s more likely to mean the fruit.
Another example is the sentence “Summer played amazing basketball,” where “Summer” would be classified as a person due to the contextual clue provided by “basketball.” However, with no such clues present, “Summer” might also signify the season. Such ambiguities in natural language often necessitate linguistic analysis or advanced NER models trained on extensive datasets to differentiate between possible meanings.
Approaches to NER
Named entity recognition has evolved significantly over the years, with multiple approaches being developed to tackle its challenges. Here are the common NER methods.
The rule-based method of NER
Rule-based approaches to NER rely on a set of predefined rules or patterns to identify and classify named entities.
These rules are often derived from linguistic insights and are codified into the system using the following techniques.
Regular expressions. This is pattern matching to detect entities based on known structures, such as phone numbers or email addresses.
Dictionary lookups. This means leveraging predefined lists or databases of named entities to find matches in the text. Say a system has a dictionary containing names of famous authors like “Jane Austen,” “Ernest Hemingway,” and “George Orwell.” When it sees the sentence “I recently read a novel by George Orwell,” it can quickly recognize “George Orwell” as a named entity referring to an author based on its lookup list.
Pattern-based rules. This involves reliance on specific linguistic structures to infer entities. A capitalized word in the middle of a sentence might hint at a proper noun. For instance, in the sentence “If you ever want to go to London, make sure to visit the British Museum to immerse yourself in centuries of art and history,” the capitalized word “London” suggests it’s a proper noun referring to a location.
Implementation:
Example Using SpaCy
pip install spacy
python -m spacy download en_core_web_sm
import spacy
# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")
# Sample text
text = """
Apple is looking at buying U.K. startup for $1 billion.
Elon Musk, the CEO of SpaceX, tweeted about the launch of the new rocket from Florida.
The Amazon rainforest is located in South America.
"""
# Process the text with SpaCy
doc = nlp(text)
# Print named entities, their labels, and the sentences they appear in
for ent in doc.ents:
print(ent.text, ent.label_)
# To see all entities in the text
for ent in doc.ents:
print(f"Entity: {ent.text}, Label: {ent.label_}, Start: {ent.start_char}, End: {ent.end_char}")
Output Explanation
- Apple – ORG (Organization)
- U.K. – GPE (Geopolitical Entity)
- $1 billion – MONEY (Monetary Value)
- Elon Musk – PERSON (Person)
- SpaceX – ORG (Organization)
- Florida – GPE (Geopolitical Entity)
- Amazon rainforest – LOC (Location)
- South America – LOC (Location)
Example Using NLTK
NLTK doesn’t have a built-in named entity recognizer that is as sophisticated as SpaCy, but it can still perform basic NER using its chunking capabilities.
Install NLTK
You need to install NLTK and download the necessary resources:
pip install nltk
import nltk
nltk.download('punkt') nltk.download('maxent_ne_chunker') nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
# Sample text
text = """
Apple is looking at buying U.K. startup for $1 billion.
Elon Musk, the CEO of SpaceX, tweeted about the launch of the new rocket from Florida.
The Amazon rainforest is located in South America.
"""
# Tokenize the text
tokens = word_tokenize(text)
# Apply part of speech tagging
tagged = pos_tag(tokens)
# Perform named entity recognition
entities = ne_chunk(tagged)
# Print named entities
for chunk in entities:
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
Output Explanation
GPE U.K.
PERSON Elon Musk
ORGANIZATION SpaceX
GPE Florida
GPE South America
Applications of NER
i) Information Extraction:
- Automatically extracting important entities from large corpora of text.
- Example: Extracting company names and financial figures from news articles.
ii) Customer Support:
- Identifying and categorizing entities in customer support tickets.
- Example: Recognizing product names, issue types, and user details.
iii) Search Engines:
- Enhancing search algorithms by identifying entities within queries.
- Example: Understanding that “Apple” in a query is likely referring to the company rather than the fruit.
iv) Social Media Analysis:
- Tracking mentions of brands, places, and public figures on social media platforms.
- Example: Analyzing tweets to identify and categorize references to political figures or locations.