pivot_table() Method: Pivot DataFrame with Aggregation Operation¶
Date published: 2018-11-23
Category: Data Analysis
Subcategory: Data Wrangling
Tags: pivot table, python, pandas
Pivot tables allow us to perform group-bys on columns and specify aggregate metrics for columns too. This data analysis technique is very popular in GUI spreadsheet applications and also works well in Python using the pandas package and the DataFrame pivot_table() method.
Import Modules¶
import pandas as pd
import seaborn as sns
import numpy as np
Example 1: Pivot Tables with Flights Dataset¶
Get Data¶
Let's get the flights
dataset included in the seaborn
library and assign it to the DataFrame df_flights
.
df_flights = sns.load_dataset('flights')
Preview the first few rows of df_flights
.
Each row represents a month's flight history details. The passengers
column represents that total number of passengers that flew that month.
df_flights.head()
year | month | passengers | |
---|---|---|---|
0 | 1949 | January | 112 |
1 | 1949 | February | 118 |
2 | 1949 | March | 132 |
3 | 1949 | April | 129 |
4 | 1949 | May | 121 |
Implement Pivot Tables¶
I want to know the sum of passengers that flew on planes for each year. So, from pandas, we'll call the pivot_table()
method and set the following arguments:
data
to be our DataFramedf_flights
index
to be'year'
since that's the column fromdf_flights
that we want to appear as a unique value in each rowvalues
as'passengers'
since that's the column we want to apply some aggregate operation onaggfunc
to'sum'
since we want to sum (aka total) up all values inpassengers
that belong to a unique year
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum')
year 1949 1520 1950 1676 1951 2042 1952 2364 1953 2700 1954 2867 1955 3408 1956 3939 1957 4421 1958 4572 1959 5140 1960 5714 Name: passengers, dtype: int64
We can see above that every year, the total number of passengers that flew increased each year. However, this pivot table is poorly documented. There's no column name or label for the values on the right. To fix this, we'll cast this outputted series to a DataFrame and rename the aggregated column to be clearer.
Use the reset_index() method to cast our series to a DataFrame with a new index and then chain the rename() method to rename the 'passengers'
column to 'total_passengers'
.
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
year | total_passengers | |
---|---|---|
0 | 1949 | 1520 |
1 | 1950 | 1676 |
2 | 1951 | 2042 |
3 | 1952 | 2364 |
4 | 1953 | 2700 |
5 | 1954 | 2867 |
6 | 1955 | 3408 |
7 | 1956 | 3939 |
8 | 1957 | 4421 |
9 | 1958 | 4572 |
10 | 1959 | 5140 |
11 | 1960 | 5714 |
Now, I want to know the sum of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table()
method and include all of the same arguments above, except we'll set the index
to be month
since that's the column from df_flights
that we want to appear as a unique value in each row.
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
month | total_passengers | |
---|---|---|
0 | January | 2901 |
1 | February | 2820 |
2 | March | 3242 |
3 | April | 3205 |
4 | May | 3262 |
5 | June | 3740 |
6 | July | 4216 |
7 | August | 4213 |
8 | September | 3629 |
9 | October | 3199 |
10 | November | 2794 |
11 | December | 3142 |
Our results indicate most people flew in the summer months of July and August.
Now, I want to know the average number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table()
method and include all of the same arguments from the previous operation, except we'll set the aggfunc
to mean
since we want to find the mean (aka average) number of passengers that flew in each unique month.
I also rounded the average_number_of_passengers
column to 1 decimal place so it's easier to read.
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='mean').reset_index().rename(columns={'passengers': 'average_number_of_passengers'}).round(1)
month | average_number_of_passengers | |
---|---|---|
0 | January | 241.8 |
1 | February | 235.0 |
2 | March | 270.2 |
3 | April | 267.1 |
4 | May | 271.8 |
5 | June | 311.7 |
6 | July | 351.3 |
7 | August | 351.1 |
8 | September | 302.4 |
9 | October | 266.6 |
10 | November | 232.8 |
11 | December | 261.8 |
Now, I want to know the maximum number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table()
method and include all of the same arguments from the previous operation, except we'll set the aggfunc
to 'max'
since we want to find the maximum (aka largest) number of passengers that flew in each unique month.
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='max').reset_index().rename(columns={'passengers': 'max_number_of_passengers'})
month | max_number_of_passengers | |
---|---|---|
0 | January | 417 |
1 | February | 391 |
2 | March | 419 |
3 | April | 461 |
4 | May | 472 |
5 | June | 535 |
6 | July | 622 |
7 | August | 606 |
8 | September | 508 |
9 | October | 461 |
10 | November | 390 |
11 | December | 432 |
Example 2: Pivot Tables with Tips Dataset¶
Get Data¶
Let's get the tips
dataset included in the seaborn
library and assign it to the DataFrame df_tips
.
df_tips = sns.load_dataset('tips')
Each row represents a unique meal for a party of people at a restaurant; the dataset includes the following fields:
column name | column description |
---|---|
total_bill |
financial amount of meal in U.S. dollars |
tip |
financial amount of the meal's tip in U.S. dollars |
sex |
gender of server |
smoker |
boolean to represent if server smokes or not |
day |
day of week |
time |
meal name (Lunch or Dinner) |
size |
count of people eating meal |
Preview the first 5 rows of df_tips
.
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Implement Pivot Tables¶
For each day and meal type, I'm curious to find the median bill amount. So, from pandas, we'll call the pivot_table()
method and set the following arguments:
data
to be our DataFramedf_tips
index
to be['day', 'time']
since we want to aggregate by both of those columns so each row represents a unique type of meal for a dayvalues
as'total_bill'
since that's the column we want to apply some aggregate operation onaggfunc
to'median'
since we want to find the median value in ourtotal_bill
column for each day's meal type in our dataset
pd.pivot_table(data=df_tips, index=['day', 'time'], values='total_bill', aggfunc='median').reset_index().rename(columns={'total_bill': 'median_total_bill'})
day | time | median_total_bill | |
---|---|---|---|
0 | Thur | Lunch | 16.000 |
1 | Thur | Dinner | 18.780 |
2 | Fri | Lunch | 13.420 |
3 | Fri | Dinner | 18.665 |
4 | Sat | Lunch | NaN |
5 | Sat | Dinner | 18.240 |
6 | Sun | Lunch | NaN |
7 | Sun | Dinner | 19.630 |
Now, I'm curious for each day's meal type, what was the maximum tip value and median total bill. We can do this in one pivot table operation too! From pandas, we'll call the pivot_table()
method and set the following arguments:
data
to be our DataFramedf_tips
index
to be['day', 'time']
since we want to aggregate by both of those columns so each row represents a unique type of meal for a dayvalues
as['total_bill', 'tip']
since we want to perform a specific aggregate operation on each of those columnsaggfunc
to{'total_bill': 'median', 'tip': 'max'}
since we want to find the median value for thetotal_bill
column and the maximum value for thetip
column
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc={'total_bill': 'median', 'tip': 'max'}).reset_index().rename(columns={'tip': 'max_tip', 'total_bill': 'median_total_bill'})
day | time | max_tip | median_total_bill | |
---|---|---|---|---|
0 | Thur | Lunch | 6.70 | 16.000 |
1 | Thur | Dinner | 3.00 | 18.780 |
2 | Fri | Lunch | 3.48 | 13.420 |
3 | Fri | Dinner | 4.73 | 18.665 |
4 | Sat | Dinner | 10.00 | 18.240 |
5 | Sun | Dinner | 6.50 | 19.630 |
Now, I'm curious for each day's meal type, what was the total amount of tips and total bills. I'd also like a new row for the sum of the total_tip
and total_bill
columns. To implement this, from pandas, we'll call the pivot_table()
method and set the following arguments:
data
to be our DataFramedf_tips
index
to be['day', 'time']
since we want to aggregate by both of those columns so each row represents a unique type of meal for a dayvalues
as['total_bill', 'tip']
since we want to perform a specific aggregate operation on each of those columnsaggfunc
to'sum'
since we want to find the sum (aka total) for each column passed to thevalues
argumentmargins
toTrue
to add up the numerical columns
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
day | time | total_tip | total_bill | |
---|---|---|---|---|
0 | Thur | Lunch | 168.83 | 1077.55 |
1 | Thur | Dinner | 3.00 | 18.78 |
2 | Fri | Lunch | 16.68 | 89.92 |
3 | Fri | Dinner | 35.28 | 235.96 |
4 | Sat | Lunch | NaN | NaN |
5 | Sat | Dinner | 260.40 | 1778.40 |
6 | Sun | Lunch | NaN | NaN |
7 | Sun | Dinner | 247.39 | 1627.16 |
8 | All | 731.58 | 4827.77 |
Instead of the final row saying All
, I'd prefer it say total
. We can perform this change in our pivot_table()
method. Utilize all the same code as above and we'll add a new argument called margins_name
and set it to 'total'
.
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total').reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
day | time | total_tip | total_bill | |
---|---|---|---|---|
0 | Thur | Lunch | 168.83 | 1077.55 |
1 | Thur | Dinner | 3.00 | 18.78 |
2 | Fri | Lunch | 16.68 | 89.92 |
3 | Fri | Dinner | 35.28 | 235.96 |
4 | Sat | Lunch | NaN | NaN |
5 | Sat | Dinner | 260.40 | 1778.40 |
6 | Sun | Lunch | NaN | NaN |
7 | Sun | Dinner | 247.39 | 1627.16 |
8 | total | 731.58 | 4827.77 |
I don't like how in the output above, we see NaN
values in our columns. Those essentially mean there were no numerical values to perform in the aggregate operations so we're left with a value of NaN
equivalent to not a number. We can fill in those NaN
values with 0
.
We can perform this change in our pivot_table()
method. Utilize all the same code as above and we'll add a new argument called fill_value
and set it to 0
.
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total', fill_value=0).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
day | time | total_tip | total_bill | |
---|---|---|---|---|
0 | Thur | Lunch | 168.83 | 1077.55 |
1 | Thur | Dinner | 3.00 | 18.78 |
2 | Fri | Lunch | 16.68 | 89.92 |
3 | Fri | Dinner | 35.28 | 235.96 |
4 | Sat | Lunch | 0.00 | 0.00 |
5 | Sat | Dinner | 260.40 | 1778.40 |
6 | Sun | Lunch | 0.00 | 0.00 |
7 | Sun | Dinner | 247.39 | 1627.16 |
8 | total | 731.58 | 4827.77 |
Aggregate Operations¶
Other aggregate methods you could perform with a groupby()
method in pandas are:
method | description |
---|---|
sum() |
summation |
mean() |
average |
count() |
count of all values |
size() |
count of non-null values |
max() |
maximum value |
min() |
minimum value |
std() |
standard deviation |
median() |
median |