Data cleaning is the process of preparing data for analysis by removing or fixing data that is incorrect, incomplete, irrelevant, or duplicated within a dataset. It’s one of the important stages of machine learning. It plays a significant part in building a model.
Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values within a dataset. It’s like preparing your ingredients before cooking; you want everything in order to get the perfect analysis or visualization.
1. Handling Missing Data
Data often contains missing values that need to be addressed for accurate analysis. Pandas provides several methods to handle missing data.
- Detect Missing Data: Use
.isnull()
or.notnull()
to detect missing values. - Example:
df.isnull() # Returns a DataFrame with True where values are missing
2. Removing Duplicates
Duplicate data can lead to inaccurate results. Pandas allows you to easily detect and remove duplicate rows.
- Detect Duplicates: Use
.duplicated()
to identify duplicates. - Example:
df.duplicated() # Returns True for duplicated rows
Remove Duplicates: Use .drop_duplicates()
to remove duplicate rows.
Example:
df = df.drop_duplicates() # Removes duplicate rows
3. Data Transformation
Converting data into the correct format is crucial for analysis. Pandas provides functions for type conversion and other transformations.
- Change Data Type: Use
.astype()
to change the data type of a column. - Example:
df['Age'] = df['Age'].astype(int) # Converts Age to integer type
String Operations: Use .str
accessor for string transformations like uppercasing, lowercasing, or replacing substrings.
Example:
df['Name'] = df['Name'].str.upper() # Converts all names to uppercase
4. Handling Outliers
Outliers can distort data analysis results. Pandas can help in detecting and removing outliers.
- Detect Outliers: Use statistical methods or filters to identify outliers.
- Example:
df[df['Salary'] > df['Salary'].quantile(0.95)] # Detects values in the 95th percentile
Remove Outliers: Filter out extreme values based on domain knowledge.
Example:
df = df[df['Salary'] < 100000] # Removes rows where Salary is greater than 100,000
5. Renaming Columns
Columns may need to be renamed for consistency or clarity. Use .rename()
to rename columns.
Example:
df.rename(columns={'old_name': 'new_name'}, inplace=True)
6. Data Binning
Binning groups continuous data into discrete intervals or categories. This is useful for segmenting data.
Example:
bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
7. Feature Engineering
Feature engineering involves creating new features from existing data. Pandas allows you to create new columns based on conditions or calculations.
Example:
df['Income_per_Age'] = df['Income'] / df['Age']
8. Scaling and Normalization
Scaling and normalization adjust data to a common scale or range, which can be important for certain models. Pandas can normalize data by applying mathematical transformations.
Example:
df['Salary'] = (df['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())
9. Concatenating and Merging Data
Combining multiple datasets into a single DataFrame is a common task. Use .concat()
and .merge()
for this purpose.
- Concatenating: Combine datasets along a particular axis.
- Example:
df_combined = pd.concat([df1, df2], axis=0) # Concatenates rows
Merging: Merge two DataFrames based on common columns.
Example:
df_merged = pd.merge(df1, df2, on='ID') # Merges based on the 'ID' column
Problem Statement: A retail company wants to analyze customer purchase data to understand buying patterns, but the raw data is messy, containing missing values, duplicates, and inconsistent data formats. The company needs to clean and prepare the data for accurate insights.
Dataset Description:
- Columns:
Customer_ID
,Name
,Age
,Gender
,Purchase_Amount
,Product_Category
,Purchase_Date
- Issues:
- Missing values in
Age
andPurchase_Amount
- Duplicate entries
- Inconsistent formatting in
Gender
(e.g., ‘M’, ‘Male’, ‘F’, ‘Female’) - Outliers in
Purchase_Amount
(values significantly higher than the norm)
- Missing values in
Steps for Data Cleaning and Preparation:
1. Loading Data into Pandas
The first step is to load the data into a Pandas DataFrame for further analysis.
import pandas as pd
# Load dataset
df = pd.read_csv('customer_data.csv')
2. Handling Missing Data
- Missing values in
Age
andPurchase_Amount
columns are handled by fillingAge
with the median age andPurchase_Amount
with the mean value.
# Fill missing Age with median value
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing Purchase_Amount with mean value
df['Purchase_Amount'].fillna(df['Purchase_Amount'].mean(), inplace=True)
3. Removing Duplicates
- Some rows have duplicate customer entries. Removing them ensures data integrity.
# Remove duplicate rows
df = df.drop_duplicates()
4. Standardizing Categorical Data
- The
Gender
column has inconsistent values (‘M’, ‘Male’, ‘F’, ‘Female’). Standardizing it ensures consistent analysis.
# Standardize Gender column
df['Gender'] = df['Gender'].replace({'M': 'Male', 'F': 'Female'})
5. Handling Outliers
- Outliers in the
Purchase_Amount
column (values far from the average) are identified and filtered.
# Identify outliers in Purchase_Amount (above 99th percentile)
upper_limit = df['Purchase_Amount'].quantile(0.99)
df = df[df['Purchase_Amount'] <= upper_limit]
6. Feature Engineering
- The company is interested in seeing how much each customer spends per month, so a new column
Monthly_Expenditure
is created by extracting the month from thePurchase_Date
and aggregating the values.
# Convert Purchase_Date to datetime format
df['Purchase_Date'] = pd.to_datetime(df['Purchase_Date'])
# Create a new column for the month of purchase
df['Purchase_Month'] = df['Purchase_Date'].dt.to_period('M')
# Calculate Monthly Expenditure per customer
monthly_expenditure = df.groupby(['Customer_ID', 'Purchase_Month'])['Purchase_Amount'].sum().reset_index()
monthly_expenditure.rename(columns={'Purchase_Amount': 'Monthly_Expenditure'}, inplace=True)
7. Exporting the Cleaned Data
- After cleaning and preparing the data, the final DataFrame is exported for further analysis or machine learning models.
# Export cleaned dataset to CSV
df.to_csv('cleaned_customer_data.csv', index=False)
Results and Insights:
After cleaning the data, the retail company was able to generate useful insights, such as:
Average Monthly Expenditure: They identified average spending per customer per month.
Customer Segmentation: They segmented customers based on gender and age, with accurate data.
Outlier Analysis: Outlier transactions were removed, leading to more accurate averages and customer behavior analysis.