The groupby()
function in Pandas is one of the most powerful and flexible tools for aggregating and summarizing data. It allows you to group rows based on one or more columns and then apply aggregation functions (like sum
, mean
, count
, etc.) on the grouped data. This is especially useful for summarizing large datasets.
How groupby()
Works
The basic steps in using groupby()
are:
- Splitting the data into groups based on some criteria (group by one or more columns).
- Applying a function to each group independently (e.g., aggregation or transformation).
- Combining the results back into a DataFrame.
Syntax of groupby()
:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True)
by
: Column(s) to group by.
axis
: Whether to group rows (axis=0
) or columns (axis=1
).
as_index
: If True
, the grouped columns become the index of the result.
Example Data
Let’s say you have the following sales data:
import pandas as pd
data = {
'Product': ['A', 'B', 'A', 'B', 'C', 'A', 'C', 'B'],
'Region': ['North', 'South', 'South', 'North', 'South', 'North', 'North', 'South'],
'Sales': [250, 150, 200, 300, 100, 220, 180, 170]
}
df = pd.DataFrame(data)
print(df)
Output:
Product Region Sales
0 A North 250
1 B South 150
2 A South 200
3 B North 300
4 C South 100
5 A North 220
6 C North 180
7 B South 170
Basic groupby()
with Aggregation
To aggregate the data and summarize it by Product
:
grouped = df.groupby('Product').sum()
print(grouped)
Output:
Sales
Product
A 670
B 620
C 280
Here, the sales for each product have been summed up. You can also apply other aggregation functions like mean
, min
, max
, count
, etc.
Using Multiple Aggregation Functions:
You can apply multiple aggregation functions using agg()
:
grouped = df.groupby('Product').agg(['sum', 'mean'])
print(grouped)
Output:
Sales
sum mean
Product
A 670 223.33
B 620 206.67
C 280 140.00
This shows both the total (sum
) and average (mean
) sales for each product.
Grouping by Multiple Columns
You can also group by more than one column. For example, to group by both Product
and Region
:
grouped = df.groupby(['Product', 'Region']).sum()
print(grouped)
Output:
Sales
Product Region
A North 470
South 200
B North 300
South 320
C North 180
South 100
This shows the total sales for each combination of Product
and Region
.
Resetting the Index:
When you group by multiple columns, the grouped columns become part of the DataFrame’s index. If you want to turn them back into regular columns, you can use reset_index()
:
grouped = df.groupby(['Product', 'Region']).sum().reset_index()
print(grouped)
Output:
Product Region Sales
0 A North 470
1 A South 200
2 B North 300
3 B South 320
4 C North 180
5 C South 100
Now Product
and Region
are regular columns again.
Using Custom Aggregation with agg()
The agg()
function allows you to apply different aggregation functions to different columns.
grouped = df.groupby('Product').agg({
'Sales': ['sum', 'mean', 'count']
})
print(grouped)
Output:
Sales
sum mean count
Product
A 670 223.33 3
B 620 206.67 3
C 280 140.00 2
Here, for each product, we’re showing the total sales (sum
), average sales (mean
), and the number of occurrences (count
).
Grouping and Applying Filters
Sometimes, you may want to filter the groups after grouping. For instance, to find products with total sales greater than 500:
grouped = df.groupby('Product').sum()
filtered = grouped[grouped['Sales'] > 500]
print(filtered)
Output:
Sales
Product
A 670
B 620
This filters out products with total sales less than or equal to 500.
Group by and Apply Custom Functions
You can apply custom functions to grouped data using apply()
.
For example, you can define a custom function to calculate the range of sales (i.e., max – min) for each group:
def sales_range(group):
return group['Sales'].max() - group['Sales'].min()
grouped = df.groupby('Product').apply(sales_range)
print(grouped)
Output:
Product
A 50
B 150
C 80
dtype: int64
Here, we calculate the range of sales for each product.
Iterating Over Groups
You can also iterate over groups created by groupby()
.
grouped = df.groupby('Product')
for name, group in grouped:
print(f'Group: {name}')
print(group)
Output:
Group: A
Product Region Sales
0 A North 250
2 A South 200
5 A North 220
Group: B
Product Region Sales
1 B South 150
3 B North 300
7 B South 170
Group: C
Product Region Sales
4 C South 100
6 C North 180
This prints each group separately.