Data Cleaning and Preparation with Pandas

Data cleaning is the process of preparing data for analysis by removing or fixing data that is incorrect, incomplete, irrelevant, or duplicated within a dataset. It’s one of the important stages of machine learning. It plays a significant part in building a model.

Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values within a dataset. It’s like preparing your ingredients before cooking; you want everything in order to get the perfect analysis or visualization.

1. Handling Missing Data

Data often contains missing values that need to be addressed for accurate analysis. Pandas provides several methods to handle missing data.

  • Detect Missing Data: Use .isnull() or .notnull() to detect missing values.
  • Example:
df.isnull()  # Returns a DataFrame with True where values are missing

2. Removing Duplicates

Duplicate data can lead to inaccurate results. Pandas allows you to easily detect and remove duplicate rows.

  • Detect Duplicates: Use .duplicated() to identify duplicates.
  • Example:
df.duplicated()  # Returns True for duplicated rows

Remove Duplicates: Use .drop_duplicates() to remove duplicate rows.

Example:

df = df.drop_duplicates()  # Removes duplicate rows

3. Data Transformation

Converting data into the correct format is crucial for analysis. Pandas provides functions for type conversion and other transformations.

  • Change Data Type: Use .astype() to change the data type of a column.
  • Example:
df['Age'] = df['Age'].astype(int)  # Converts Age to integer type

String Operations: Use .str accessor for string transformations like uppercasing, lowercasing, or replacing substrings.

Example:

df['Name'] = df['Name'].str.upper()  # Converts all names to uppercase

4. Handling Outliers

Outliers can distort data analysis results. Pandas can help in detecting and removing outliers.

  • Detect Outliers: Use statistical methods or filters to identify outliers.
  • Example:
df[df['Salary'] > df['Salary'].quantile(0.95)]  # Detects values in the 95th percentile

Remove Outliers: Filter out extreme values based on domain knowledge.

Example:

df = df[df['Salary'] < 100000]  # Removes rows where Salary is greater than 100,000

5. Renaming Columns

Columns may need to be renamed for consistency or clarity. Use .rename() to rename columns.

Example:

df.rename(columns={'old_name': 'new_name'}, inplace=True)

6. Data Binning

Binning groups continuous data into discrete intervals or categories. This is useful for segmenting data.

Example:

bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

7. Feature Engineering

Feature engineering involves creating new features from existing data. Pandas allows you to create new columns based on conditions or calculations.

Example:

df['Income_per_Age'] = df['Income'] / df['Age']

8. Scaling and Normalization

Scaling and normalization adjust data to a common scale or range, which can be important for certain models. Pandas can normalize data by applying mathematical transformations.

Example:

df['Salary'] = (df['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())

9. Concatenating and Merging Data

Combining multiple datasets into a single DataFrame is a common task. Use .concat() and .merge() for this purpose.

  • Concatenating: Combine datasets along a particular axis.
  • Example:
df_combined = pd.concat([df1, df2], axis=0)  # Concatenates rows

Merging: Merge two DataFrames based on common columns.

Example:

df_merged = pd.merge(df1, df2, on='ID')  # Merges based on the 'ID' column

Problem Statement: A retail company wants to analyze customer purchase data to understand buying patterns, but the raw data is messy, containing missing values, duplicates, and inconsistent data formats. The company needs to clean and prepare the data for accurate insights.

Dataset Description:

  • Columns: Customer_ID, Name, Age, Gender, Purchase_Amount, Product_Category, Purchase_Date
  • Issues:
    • Missing values in Age and Purchase_Amount
    • Duplicate entries
    • Inconsistent formatting in Gender (e.g., ‘M’, ‘Male’, ‘F’, ‘Female’)
    • Outliers in Purchase_Amount (values significantly higher than the norm)

Steps for Data Cleaning and Preparation:

1. Loading Data into Pandas

The first step is to load the data into a Pandas DataFrame for further analysis.

import pandas as pd

# Load dataset
df = pd.read_csv('customer_data.csv')

2. Handling Missing Data

  • Missing values in Age and Purchase_Amount columns are handled by filling Age with the median age and Purchase_Amount with the mean value.
# Fill missing Age with median value
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Purchase_Amount with mean value
df['Purchase_Amount'].fillna(df['Purchase_Amount'].mean(), inplace=True)

3. Removing Duplicates

  • Some rows have duplicate customer entries. Removing them ensures data integrity.
# Remove duplicate rows
df = df.drop_duplicates()

4. Standardizing Categorical Data

  • The Gender column has inconsistent values (‘M’, ‘Male’, ‘F’, ‘Female’). Standardizing it ensures consistent analysis.
# Standardize Gender column
df['Gender'] = df['Gender'].replace({'M': 'Male', 'F': 'Female'})

5. Handling Outliers

  • Outliers in the Purchase_Amount column (values far from the average) are identified and filtered.
# Identify outliers in Purchase_Amount (above 99th percentile)
upper_limit = df['Purchase_Amount'].quantile(0.99)
df = df[df['Purchase_Amount'] <= upper_limit]

6. Feature Engineering

  • The company is interested in seeing how much each customer spends per month, so a new column Monthly_Expenditure is created by extracting the month from the Purchase_Date and aggregating the values.
# Convert Purchase_Date to datetime format
df['Purchase_Date'] = pd.to_datetime(df['Purchase_Date'])

# Create a new column for the month of purchase
df['Purchase_Month'] = df['Purchase_Date'].dt.to_period('M')

# Calculate Monthly Expenditure per customer
monthly_expenditure = df.groupby(['Customer_ID', 'Purchase_Month'])['Purchase_Amount'].sum().reset_index()
monthly_expenditure.rename(columns={'Purchase_Amount': 'Monthly_Expenditure'}, inplace=True)

7. Exporting the Cleaned Data

  • After cleaning and preparing the data, the final DataFrame is exported for further analysis or machine learning models.
# Export cleaned dataset to CSV
df.to_csv('cleaned_customer_data.csv', index=False)

Results and Insights:
After cleaning the data, the retail company was able to generate useful insights, such as:

Average Monthly Expenditure: They identified average spending per customer per month.
Customer Segmentation: They segmented customers based on gender and age, with accurate data.
Outlier Analysis: Outlier transactions were removed, leading to more accurate averages and customer behavior analysis.

Related Posts

Introduction to Pandas: Basics and Core Concepts

Pandas is a powerful and versatile library that simplifies the tasks of data manipulation in Python. Pandas is well-suited for working with tabular data, such as spreadsheets…

Leave a Reply

Your email address will not be published. Required fields are marked *