Data Analysis Data Wrangling Tutorial

# pivot_table() Method: Pivot DataFrame with Aggregation Operation

Pivot tables allow us to perform group-bys on columns and specify aggregate metrics for columns too. This data analysis technique is very popular in GUI spreadsheet applications and also works well in Python using the pandas package and the DataFrame pivot_table() method.

### Import Modules¶

In [25]:
import pandas as pd
import seaborn as sns
import numpy as np


### Example 1: Pivot Tables with Flights Dataset¶

#### Get Data¶

Let's get the flights dataset included in the seaborn library and assign it to the DataFrame df_flights.

In [26]:
df_flights = sns.load_dataset('flights')


Preview the first few rows of df_flights.

Each row represents a month's flight history details. The passengers column represents that total number of passengers that flew that month.

In [27]:
df_flights.head()

Out[27]:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121

#### Implement Pivot Tables¶

I want to know the sum of passengers that flew on planes for each year. So, from pandas, we'll call the pivot_table() method and set the following arguments:

• data to be our DataFrame df_flights
• index to be 'year' since that's the column from df_flights that we want to appear as a unique value in each row
• values as 'passengers' since that's the column we want to apply some aggregate operation on
• aggfunc to 'sum' since we want to sum (aka total) up all values in passengers that belong to a unique year
In [28]:
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum')

Out[28]:
year
1949    1520
1950    1676
1951    2042
1952    2364
1953    2700
1954    2867
1955    3408
1956    3939
1957    4421
1958    4572
1959    5140
1960    5714
Name: passengers, dtype: int64

We can see above that every year, the total number of passengers that flew increased each year. However, this pivot table is poorly documented. There's no column name or label for the values on the right. To fix this, we'll cast this outputted series to a DataFrame and rename the aggregated column to be clearer.

Use the reset_index() method to cast our series to a DataFrame with a new index and then chain the rename() method to rename the 'passengers' column to 'total_passengers'.

In [29]:
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})

Out[29]:
year total_passengers
0 1949 1520
1 1950 1676
2 1951 2042
3 1952 2364
4 1953 2700
5 1954 2867
6 1955 3408
7 1956 3939
8 1957 4421
9 1958 4572
10 1959 5140
11 1960 5714

Now, I want to know the sum of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments above, except we'll set the index to be month since that's the column from df_flights that we want to appear as a unique value in each row.

In [30]:
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})

Out[30]:
month total_passengers
0 January 2901
1 February 2820
2 March 3242
3 April 3205
4 May 3262
5 June 3740
6 July 4216
7 August 4213
8 September 3629
9 October 3199
10 November 2794
11 December 3142

Our results indicate most people flew in the summer months of July and August.

Now, I want to know the average number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments from the previous operation, except we'll set the aggfunc to mean since we want to find the mean (aka average) number of passengers that flew in each unique month.

I also rounded the average_number_of_passengers column to 1 decimal place so it's easier to read.

In [31]:
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='mean').reset_index().rename(columns={'passengers': 'average_number_of_passengers'}).round(1)

Out[31]:
month average_number_of_passengers
0 January 241.8
1 February 235.0
2 March 270.2
3 April 267.1
4 May 271.8
5 June 311.7
6 July 351.3
7 August 351.1
8 September 302.4
9 October 266.6
10 November 232.8
11 December 261.8

Now, I want to know the maximum number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments from the previous operation, except we'll set the aggfunc to 'max' since we want to find the maximum (aka largest) number of passengers that flew in each unique month.

In [32]:
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='max').reset_index().rename(columns={'passengers': 'max_number_of_passengers'})

Out[32]:
month max_number_of_passengers
0 January 417
1 February 391
2 March 419
3 April 461
4 May 472
5 June 535
6 July 622
7 August 606
8 September 508
9 October 461
10 November 390
11 December 432

### Example 2: Pivot Tables with Tips Dataset¶

#### Get Data¶

Let's get the tips dataset included in the seaborn library and assign it to the DataFrame df_tips.

In [33]:
df_tips = sns.load_dataset('tips')


Each row represents a unique meal for a party of people at a restaurant; the dataset includes the following fields:

column name column description
total_bill financial amount of meal in U.S. dollars
tip financial amount of the meal's tip in U.S. dollars
sex gender of server
smoker boolean to represent if server smokes or not
day day of week
time meal name (Lunch or Dinner)
size count of people eating meal

Preview the first 5 rows of df_tips.

In [34]:
df_tips.head()

Out[34]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

#### Implement Pivot Tables¶

For each day and meal type, I'm curious to find the median bill amount. So, from pandas, we'll call the pivot_table() method and set the following arguments:

• data to be our DataFrame df_tips
• index to be ['day', 'time'] since we want to aggregate by both of those columns so each row represents a unique type of meal for a day
• values as 'total_bill' since that's the column we want to apply some aggregate operation on
• aggfunc to 'median' since we want to find the median value in our total_bill column for each day's meal type in our dataset
In [35]:
pd.pivot_table(data=df_tips, index=['day', 'time'], values='total_bill', aggfunc='median').reset_index().rename(columns={'total_bill': 'median_total_bill'})

Out[35]:
day time median_total_bill
0 Thur Lunch 16.000
1 Thur Dinner 18.780
2 Fri Lunch 13.420
3 Fri Dinner 18.665
4 Sat Lunch NaN
5 Sat Dinner 18.240
6 Sun Lunch NaN
7 Sun Dinner 19.630

Now, I'm curious for each day's meal type, what was the maximum tip value and median total bill. We can do this in one pivot table operation too! From pandas, we'll call the pivot_table() method and set the following arguments:

• data to be our DataFrame df_tips
• index to be ['day', 'time'] since we want to aggregate by both of those columns so each row represents a unique type of meal for a day
• values as ['total_bill', 'tip'] since we want to perform a specific aggregate operation on each of those columns
• aggfunc to {'total_bill': 'median', 'tip': 'max'} since we want to find the median value for the total_bill column and the maximum value for the tip column
In [36]:
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc={'total_bill': 'median', 'tip': 'max'}).reset_index().rename(columns={'tip': 'max_tip', 'total_bill': 'median_total_bill'})

Out[36]:
day time max_tip median_total_bill
0 Thur Lunch 6.70 16.000
1 Thur Dinner 3.00 18.780
2 Fri Lunch 3.48 13.420
3 Fri Dinner 4.73 18.665
4 Sat Dinner 10.00 18.240
5 Sun Dinner 6.50 19.630

Now, I'm curious for each day's meal type, what was the total amount of tips and total bills. I'd also like a new row for the sum of the total_tip and total_bill columns. To implement this, from pandas, we'll call the pivot_table() method and set the following arguments:

• data to be our DataFrame df_tips
• index to be ['day', 'time'] since we want to aggregate by both of those columns so each row represents a unique type of meal for a day
• values as ['total_bill', 'tip'] since we want to perform a specific aggregate operation on each of those columns
• aggfunc to 'sum' since we want to find the sum (aka total) for each column passed to the values argument
• margins to True to add up the numerical columns
In [37]:
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})

Out[37]:
day time total_tip total_bill
0 Thur Lunch 168.83 1077.55
1 Thur Dinner 3.00 18.78
2 Fri Lunch 16.68 89.92
3 Fri Dinner 35.28 235.96
4 Sat Lunch NaN NaN
5 Sat Dinner 260.40 1778.40
6 Sun Lunch NaN NaN
7 Sun Dinner 247.39 1627.16
8 All 731.58 4827.77

Instead of the final row saying All, I'd prefer it say total. We can perform this change in our pivot_table() method. Utilize all the same code as above and we'll add a new argument called margins_name and set it to 'total'.

In [38]:
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total').reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})

Out[38]:
day time total_tip total_bill
0 Thur Lunch 168.83 1077.55
1 Thur Dinner 3.00 18.78
2 Fri Lunch 16.68 89.92
3 Fri Dinner 35.28 235.96
4 Sat Lunch NaN NaN
5 Sat Dinner 260.40 1778.40
6 Sun Lunch NaN NaN
7 Sun Dinner 247.39 1627.16
8 total 731.58 4827.77

I don't like how in the output above, we see NaN values in our columns. Those essentially mean there were no numerical values to perform in the aggregate operations so we're left with a value of NaN equivalent to not a number. We can fill in those NaN values with 0.

We can perform this change in our pivot_table() method. Utilize all the same code as above and we'll add a new argument called fill_value and set it to 0.

In [39]:
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total', fill_value=0).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})

Out[39]:
day time total_tip total_bill
0 Thur Lunch 168.83 1077.55
1 Thur Dinner 3.00 18.78
2 Fri Lunch 16.68 89.92
3 Fri Dinner 35.28 235.96
4 Sat Lunch 0.00 0.00
5 Sat Dinner 260.40 1778.40
6 Sun Lunch 0.00 0.00
7 Sun Dinner 247.39 1627.16
8 total 731.58 4827.77

### Aggregate Operations¶

Other aggregate methods you could perform with a groupby() method in pandas are:

method description
sum() summation
mean() average
count() count of all values
size() count of non-null values
max() maximum value
min() minimum value
std() standard deviation
median() median