Grouping Data With GroupBy

The groupby() function in Pandas is one of the most powerful and flexible tools for aggregating and summarizing data. It allows you to group rows based on one or more columns and then apply aggregation functions (like sum, mean, count, etc.) on the grouped data. This is especially useful for summarizing large datasets.

How `groupby()` Works

The basic steps in using groupby() are:

Splitting the data into groups based on some criteria (group by one or more columns).
Applying a function to each group independently (e.g., aggregation or transformation).
Combining the results back into a DataFrame.

Syntax of groupby():

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True)

by: Column(s) to group by.

axis: Whether to group rows (axis=0) or columns (axis=1).

as_index: If True, the grouped columns become the index of the result.

Example Data

Let’s say you have the following sales data:

import pandas as pd

data = {
    'Product': ['A', 'B', 'A', 'B', 'C', 'A', 'C', 'B'],
    'Region': ['North', 'South', 'South', 'North', 'South', 'North', 'North', 'South'],
    'Sales': [250, 150, 200, 300, 100, 220, 180, 170]
}

df = pd.DataFrame(data)
print(df)

Output:

  Product Region  Sales
0       A  North    250
1       B  South    150
2       A  South    200
3       B  North    300
4       C  South    100
5       A  North    220
6       C  North    180
7       B  South    170

Basic groupby() with Aggregation

To aggregate the data and summarize it by Product:

grouped = df.groupby('Product').sum()
print(grouped)

Output:

         Sales
Product       
A          670
B          620
C          280

Here, the sales for each product have been summed up. You can also apply other aggregation functions like mean, min, max, count, etc.

Using Multiple Aggregation Functions:

You can apply multiple aggregation functions using agg():

grouped = df.groupby('Product').agg(['sum', 'mean'])
print(grouped)

Output:

         Sales      
           sum   mean
Product               
A          670  223.33
B          620  206.67
C          280  140.00

This shows both the total (sum) and average (mean) sales for each product.

Grouping by Multiple Columns

You can also group by more than one column. For example, to group by both Product and Region:

grouped = df.groupby(['Product', 'Region']).sum()
print(grouped)

Output:

                Sales
Product Region       
A       North     470
        South     200
B       North     300
        South     320
C       North     180
        South     100

This shows the total sales for each combination of Product and Region.

Resetting the Index:

When you group by multiple columns, the grouped columns become part of the DataFrame’s index. If you want to turn them back into regular columns, you can use reset_index():

grouped = df.groupby(['Product', 'Region']).sum().reset_index()
print(grouped)

Output:

  Product Region  Sales
0       A  North    470
1       A  South    200
2       B  North    300
3       B  South    320
4       C  North    180
5       C  South    100

Now Product and Region are regular columns again.

Using Custom Aggregation with `agg()`

The agg() function allows you to apply different aggregation functions to different columns.

grouped = df.groupby('Product').agg({
    'Sales': ['sum', 'mean', 'count']
})
print(grouped)

Output:

         Sales             
           sum   mean count
Product                      
A          670  223.33     3
B          620  206.67     3
C          280  140.00     2

Here, for each product, we’re showing the total sales (sum), average sales (mean), and the number of occurrences (count).

Grouping and Applying Filters

Sometimes, you may want to filter the groups after grouping. For instance, to find products with total sales greater than 500:

grouped = df.groupby('Product').sum()
filtered = grouped[grouped['Sales'] > 500]
print(filtered)

Output:

         Sales
Product       
A          670
B          620

This filters out products with total sales less than or equal to 500.

Group by and Apply Custom Functions

You can apply custom functions to grouped data using apply().

For example, you can define a custom function to calculate the range of sales (i.e., max – min) for each group:

def sales_range(group):
    return group['Sales'].max() - group['Sales'].min()

grouped = df.groupby('Product').apply(sales_range)
print(grouped)

Output:

Product
A    50
B    150
C     80
dtype: int64

Here, we calculate the range of sales for each product.

Iterating Over Groups

You can also iterate over groups created by groupby().

grouped = df.groupby('Product')

for name, group in grouped:
    print(f'Group: {name}')
    print(group)

Output:

Group: A
  Product Region  Sales
0       A  North    250
2       A  South    200
5       A  North    220
Group: B
  Product Region  Sales
1       B  South    150
3       B  North    300
7       B  South    170
Group: C
  Product Region  Sales
4       C  South    100
6       C  North    180

This prints each group separately.

sanodsolutions

Grouping Data With GroupBy

How `groupby()` Works

Example Data

Using Multiple Aggregation Functions:

Grouping by Multiple Columns

Resetting the Index:

Using Custom Aggregation with `agg()`

Grouping and Applying Filters

Leave a Reply Cancel reply

How groupby() Works

Example Data

Using Multiple Aggregation Functions:

Grouping by Multiple Columns

Resetting the Index:

Using Custom Aggregation with agg()

Grouping and Applying Filters

Related Posts

Machine Learning: Transformative Uses and Applications Shaping the Future

Supervised vs. Unsupervised Learning

Reshaping Data with Melt and Pivot

Pivot Tables and Cross-Tabulation

Mastering Time-Based Data Analysis in Pandas: Parsing Dates, Creating Time-Based Indices, and Time-Based Grouping

Leave a Reply Cancel reply

How `groupby()` Works

Using Custom Aggregation with `agg()`