pivot_table() Method: Pivot DataFrame with Aggregation Operation¶

Date published: 2018-11-23

Category: Data Analysis

Subcategory: Data Wrangling

Tags: pivot table, python, pandas

Pivot tables allow us to perform group-bys on columns and specify aggregate metrics for columns too. This data analysis technique is very popular in GUI spreadsheet applications and also works well in Python using the pandas package and the DataFrame pivot_table() method.

Import Modules¶

In [25]:

                
                    Copied!
                    
import pandas as pd
import seaborn as sns
import numpy as np
import pandas as pd
import seaborn as sns
import numpy as np

Example 1: Pivot Tables with Flights Dataset¶

Get Data¶

Let's get the flights dataset included in the seaborn library and assign it to the DataFrame df_flights.

In [26]:

                
                    Copied!
                    
df_flights = sns.load_dataset('flights')
df_flights = sns.load_dataset('flights')

Preview the first few rows of df_flights.

Each row represents a month's flight history details. The passengers column represents that total number of passengers that flew that month.

In [27]:

                
                    Copied!
                    
df_flights.head()
df_flights.head()

Out[27]:

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121

Implement Pivot Tables¶

I want to know the sum of passengers that flew on planes for each year. So, from pandas, we'll call the pivot_table() method and set the following arguments:

data to be our DataFrame df_flights
index to be 'year' since that's the column from df_flights that we want to appear as a unique value in each row
values as 'passengers' since that's the column we want to apply some aggregate operation on
aggfunc to 'sum' since we want to sum (aka total) up all values in passengers that belong to a unique year

In [28]:

                
                    Copied!
                    
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum')
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum')

Out[28]:

year
1949    1520
1950    1676
1951    2042
1952    2364
1953    2700
1954    2867
1955    3408
1956    3939
1957    4421
1958    4572
1959    5140
1960    5714
Name: passengers, dtype: int64

We can see above that every year, the total number of passengers that flew increased each year. However, this pivot table is poorly documented. There's no column name or label for the values on the right. To fix this, we'll cast this outputted series to a DataFrame and rename the aggregated column to be clearer.

Use the reset_index() method to cast our series to a DataFrame with a new index and then chain the rename() method to rename the 'passengers' column to 'total_passengers'.

In [29]:

                
                    Copied!
                    
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})

Out[29]:

	year	total_passengers
0	1949	1520
1	1950	1676
2	1951	2042
3	1952	2364
4	1953	2700
5	1954	2867
6	1955	3408
7	1956	3939
8	1957	4421
9	1958	4572
10	1959	5140
11	1960	5714

Now, I want to know the sum of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments above, except we'll set the index to be month since that's the column from df_flights that we want to appear as a unique value in each row.

In [30]:

                
                    Copied!
                    
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})

Out[30]:

	month	total_passengers
0	January	2901
1	February	2820
2	March	3242
3	April	3205
4	May	3262
5	June	3740
6	July	4216
7	August	4213
8	September	3629
9	October	3199
10	November	2794
11	December	3142

Our results indicate most people flew in the summer months of July and August.

Now, I want to know the average number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments from the previous operation, except we'll set the aggfunc to mean since we want to find the mean (aka average) number of passengers that flew in each unique month.

I also rounded the average_number_of_passengers column to 1 decimal place so it's easier to read.

In [31]:

                
                    Copied!
                    
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='mean').reset_index().rename(columns={'passengers': 'average_number_of_passengers'}).round(1)
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='mean').reset_index().rename(columns={'passengers': 'average_number_of_passengers'}).round(1)

Out[31]:

	month	average_number_of_passengers
0	January	241.8
1	February	235.0
2	March	270.2
3	April	267.1
4	May	271.8
5	June	311.7
6	July	351.3
7	August	351.1
8	September	302.4
9	October	266.6
10	November	232.8
11	December	261.8

Now, I want to know the maximum number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments from the previous operation, except we'll set the aggfunc to 'max' since we want to find the maximum (aka largest) number of passengers that flew in each unique month.

In [32]:

                
                    Copied!
                    
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='max').reset_index().rename(columns={'passengers': 'max_number_of_passengers'})
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='max').reset_index().rename(columns={'passengers': 'max_number_of_passengers'})

Out[32]:

	month	max_number_of_passengers
0	January	417
1	February	391
2	March	419
3	April	461
4	May	472
5	June	535
6	July	622
7	August	606
8	September	508
9	October	461
10	November	390
11	December	432

Example 2: Pivot Tables with Tips Dataset¶

Get Data¶

Let's get the tips dataset included in the seaborn library and assign it to the DataFrame df_tips.

In [33]:

                
                    Copied!
                    
df_tips = sns.load_dataset('tips')
df_tips = sns.load_dataset('tips')

Each row represents a unique meal for a party of people at a restaurant; the dataset includes the following fields:

column name	column description
`total_bill`	financial amount of meal in U.S. dollars
`tip`	financial amount of the meal's tip in U.S. dollars
`sex`	gender of server
`smoker`	boolean to represent if server smokes or not
`day`	day of week
`time`	meal name (Lunch or Dinner)
`size`	count of people eating meal

Preview the first 5 rows of df_tips.

In [34]:

                
                    Copied!
                    
df_tips.head()
df_tips.head()

Out[34]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Implement Pivot Tables¶

For each day and meal type, I'm curious to find the median bill amount. So, from pandas, we'll call the pivot_table() method and set the following arguments:

data to be our DataFrame df_tips
index to be ['day', 'time'] since we want to aggregate by both of those columns so each row represents a unique type of meal for a day
values as 'total_bill' since that's the column we want to apply some aggregate operation on
aggfunc to 'median' since we want to find the median value in our total_bill column for each day's meal type in our dataset

In [35]:

                
                    Copied!
                    
pd.pivot_table(data=df_tips, index=['day', 'time'], values='total_bill', aggfunc='median').reset_index().rename(columns={'total_bill': 'median_total_bill'})
pd.pivot_table(data=df_tips, index=['day', 'time'], values='total_bill', aggfunc='median').reset_index().rename(columns={'total_bill': 'median_total_bill'})

Out[35]:

	day	time	median_total_bill
0	Thur	Lunch	16.000
1	Thur	Dinner	18.780
2	Fri	Lunch	13.420
3	Fri	Dinner	18.665
4	Sat	Lunch	NaN
5	Sat	Dinner	18.240
6	Sun	Lunch	NaN
7	Sun	Dinner	19.630

Now, I'm curious for each day's meal type, what was the maximum tip value and median total bill. We can do this in one pivot table operation too! From pandas, we'll call the pivot_table() method and set the following arguments:

data to be our DataFrame df_tips
index to be ['day', 'time'] since we want to aggregate by both of those columns so each row represents a unique type of meal for a day
values as ['total_bill', 'tip'] since we want to perform a specific aggregate operation on each of those columns
aggfunc to {'total_bill': 'median', 'tip': 'max'} since we want to find the median value for the total_bill column and the maximum value for the tip column

In [36]:

                
                    Copied!
                    
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc={'total_bill': 'median', 'tip': 'max'}).reset_index().rename(columns={'tip': 'max_tip', 'total_bill': 'median_total_bill'})
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc={'total_bill': 'median', 'tip': 'max'}).reset_index().rename(columns={'tip': 'max_tip', 'total_bill': 'median_total_bill'})

Out[36]:

	day	time	max_tip	median_total_bill
0	Thur	Lunch	6.70	16.000
1	Thur	Dinner	3.00	18.780
2	Fri	Lunch	3.48	13.420
3	Fri	Dinner	4.73	18.665
4	Sat	Dinner	10.00	18.240
5	Sun	Dinner	6.50	19.630

Now, I'm curious for each day's meal type, what was the total amount of tips and total bills. I'd also like a new row for the sum of the total_tip and total_bill columns. To implement this, from pandas, we'll call the pivot_table() method and set the following arguments:

data to be our DataFrame df_tips
index to be ['day', 'time'] since we want to aggregate by both of those columns so each row represents a unique type of meal for a day
values as ['total_bill', 'tip'] since we want to perform a specific aggregate operation on each of those columns
aggfunc to 'sum' since we want to find the sum (aka total) for each column passed to the values argument
margins to True to add up the numerical columns

In [37]:

                
                    Copied!
                    
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})

Out[37]:

	day	time	total_tip	total_bill
0	Thur	Lunch	168.83	1077.55
1	Thur	Dinner	3.00	18.78
2	Fri	Lunch	16.68	89.92
3	Fri	Dinner	35.28	235.96
4	Sat	Lunch	NaN	NaN
5	Sat	Dinner	260.40	1778.40
6	Sun	Lunch	NaN	NaN
7	Sun	Dinner	247.39	1627.16
8	All		731.58	4827.77

Instead of the final row saying All, I'd prefer it say total. We can perform this change in our pivot_table() method. Utilize all the same code as above and we'll add a new argument called margins_name and set it to 'total'.

In [38]:

                
                    Copied!
                    
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total').reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total').reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})

Out[38]:

	day	time	total_tip	total_bill
0	Thur	Lunch	168.83	1077.55
1	Thur	Dinner	3.00	18.78
2	Fri	Lunch	16.68	89.92
3	Fri	Dinner	35.28	235.96
4	Sat	Lunch	NaN	NaN
5	Sat	Dinner	260.40	1778.40
6	Sun	Lunch	NaN	NaN
7	Sun	Dinner	247.39	1627.16
8	total		731.58	4827.77

I don't like how in the output above, we see NaN values in our columns. Those essentially mean there were no numerical values to perform in the aggregate operations so we're left with a value of NaN equivalent to not a number. We can fill in those NaN values with 0.

We can perform this change in our pivot_table() method. Utilize all the same code as above and we'll add a new argument called fill_value and set it to 0.

In [39]:

                
                    Copied!
                    
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total', fill_value=0).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total', fill_value=0).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})

Out[39]:

	day	time	total_tip	total_bill
0	Thur	Lunch	168.83	1077.55
1	Thur	Dinner	3.00	18.78
2	Fri	Lunch	16.68	89.92
3	Fri	Dinner	35.28	235.96
4	Sat	Lunch	0.00	0.00
5	Sat	Dinner	260.40	1778.40
6	Sun	Lunch	0.00	0.00
7	Sun	Dinner	247.39	1627.16
8	total		731.58	4827.77

Aggregate Operations¶

Other aggregate methods you could perform with a groupby() method in pandas are:

method	description
`sum()`	summation
`mean()`	average
`count()`	count of all values
`size()`	count of non-null values
`max()`	maximum value
`min()`	minimum value
`std()`	standard deviation
`median()`	median