pivot_table() Method: Pivot DataFrame with Aggregation Operation¶
Date published: 2018-11-23
Category: Data Analysis
Subcategory: Data Wrangling
Tags: pivot table, python, pandas
Pivot tables allow us to perform group-bys on columns and specify aggregate metrics for columns too. This data analysis technique is very popular in GUI spreadsheet applications and also works well in Python using the pandas package and the DataFrame pivot_table() method.
Import Modules¶
import pandas as pd
import seaborn as sns
import numpy as np
Example 1: Pivot Tables with Flights Dataset¶
Get Data¶
Let's get the flights dataset included in the seaborn library and assign it to the DataFrame df_flights.
df_flights = sns.load_dataset('flights')
Preview the first few rows of df_flights.
Each row represents a month's flight history details. The passengers column represents that total number of passengers that flew that month.
df_flights.head()
| year | month | passengers | |
|---|---|---|---|
| 0 | 1949 | January | 112 |
| 1 | 1949 | February | 118 |
| 2 | 1949 | March | 132 |
| 3 | 1949 | April | 129 |
| 4 | 1949 | May | 121 |
Implement Pivot Tables¶
I want to know the sum of passengers that flew on planes for each year. So, from pandas, we'll call the pivot_table() method and set the following arguments:
datato be our DataFramedf_flightsindexto be'year'since that's the column fromdf_flightsthat we want to appear as a unique value in each rowvaluesas'passengers'since that's the column we want to apply some aggregate operation onaggfuncto'sum'since we want to sum (aka total) up all values inpassengersthat belong to a unique year
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum')
year 1949 1520 1950 1676 1951 2042 1952 2364 1953 2700 1954 2867 1955 3408 1956 3939 1957 4421 1958 4572 1959 5140 1960 5714 Name: passengers, dtype: int64
We can see above that every year, the total number of passengers that flew increased each year. However, this pivot table is poorly documented. There's no column name or label for the values on the right. To fix this, we'll cast this outputted series to a DataFrame and rename the aggregated column to be clearer.
Use the reset_index() method to cast our series to a DataFrame with a new index and then chain the rename() method to rename the 'passengers' column to 'total_passengers'.
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
| year | total_passengers | |
|---|---|---|
| 0 | 1949 | 1520 |
| 1 | 1950 | 1676 |
| 2 | 1951 | 2042 |
| 3 | 1952 | 2364 |
| 4 | 1953 | 2700 |
| 5 | 1954 | 2867 |
| 6 | 1955 | 3408 |
| 7 | 1956 | 3939 |
| 8 | 1957 | 4421 |
| 9 | 1958 | 4572 |
| 10 | 1959 | 5140 |
| 11 | 1960 | 5714 |
Now, I want to know the sum of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments above, except we'll set the index to be month since that's the column from df_flights that we want to appear as a unique value in each row.
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
| month | total_passengers | |
|---|---|---|
| 0 | January | 2901 |
| 1 | February | 2820 |
| 2 | March | 3242 |
| 3 | April | 3205 |
| 4 | May | 3262 |
| 5 | June | 3740 |
| 6 | July | 4216 |
| 7 | August | 4213 |
| 8 | September | 3629 |
| 9 | October | 3199 |
| 10 | November | 2794 |
| 11 | December | 3142 |
Our results indicate most people flew in the summer months of July and August.
Now, I want to know the average number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments from the previous operation, except we'll set the aggfunc to mean since we want to find the mean (aka average) number of passengers that flew in each unique month.
I also rounded the average_number_of_passengers column to 1 decimal place so it's easier to read.
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='mean').reset_index().rename(columns={'passengers': 'average_number_of_passengers'}).round(1)
| month | average_number_of_passengers | |
|---|---|---|
| 0 | January | 241.8 |
| 1 | February | 235.0 |
| 2 | March | 270.2 |
| 3 | April | 267.1 |
| 4 | May | 271.8 |
| 5 | June | 311.7 |
| 6 | July | 351.3 |
| 7 | August | 351.1 |
| 8 | September | 302.4 |
| 9 | October | 266.6 |
| 10 | November | 232.8 |
| 11 | December | 261.8 |
Now, I want to know the maximum number of passengers that flew per month in the dataset. So, from pandas, we'll call the the pivot_table() method and include all of the same arguments from the previous operation, except we'll set the aggfunc to 'max' since we want to find the maximum (aka largest) number of passengers that flew in each unique month.
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='max').reset_index().rename(columns={'passengers': 'max_number_of_passengers'})
| month | max_number_of_passengers | |
|---|---|---|
| 0 | January | 417 |
| 1 | February | 391 |
| 2 | March | 419 |
| 3 | April | 461 |
| 4 | May | 472 |
| 5 | June | 535 |
| 6 | July | 622 |
| 7 | August | 606 |
| 8 | September | 508 |
| 9 | October | 461 |
| 10 | November | 390 |
| 11 | December | 432 |
Example 2: Pivot Tables with Tips Dataset¶
Get Data¶
Let's get the tips dataset included in the seaborn library and assign it to the DataFrame df_tips.
df_tips = sns.load_dataset('tips')
Each row represents a unique meal for a party of people at a restaurant; the dataset includes the following fields:
| column name | column description |
|---|---|
total_bill |
financial amount of meal in U.S. dollars |
tip |
financial amount of the meal's tip in U.S. dollars |
sex |
gender of server |
smoker |
boolean to represent if server smokes or not |
day |
day of week |
time |
meal name (Lunch or Dinner) |
size |
count of people eating meal |
Preview the first 5 rows of df_tips.
df_tips.head()
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Implement Pivot Tables¶
For each day and meal type, I'm curious to find the median bill amount. So, from pandas, we'll call the pivot_table() method and set the following arguments:
datato be our DataFramedf_tipsindexto be['day', 'time']since we want to aggregate by both of those columns so each row represents a unique type of meal for a dayvaluesas'total_bill'since that's the column we want to apply some aggregate operation onaggfuncto'median'since we want to find the median value in ourtotal_billcolumn for each day's meal type in our dataset
pd.pivot_table(data=df_tips, index=['day', 'time'], values='total_bill', aggfunc='median').reset_index().rename(columns={'total_bill': 'median_total_bill'})
| day | time | median_total_bill | |
|---|---|---|---|
| 0 | Thur | Lunch | 16.000 |
| 1 | Thur | Dinner | 18.780 |
| 2 | Fri | Lunch | 13.420 |
| 3 | Fri | Dinner | 18.665 |
| 4 | Sat | Lunch | NaN |
| 5 | Sat | Dinner | 18.240 |
| 6 | Sun | Lunch | NaN |
| 7 | Sun | Dinner | 19.630 |
Now, I'm curious for each day's meal type, what was the maximum tip value and median total bill. We can do this in one pivot table operation too! From pandas, we'll call the pivot_table() method and set the following arguments:
datato be our DataFramedf_tipsindexto be['day', 'time']since we want to aggregate by both of those columns so each row represents a unique type of meal for a dayvaluesas['total_bill', 'tip']since we want to perform a specific aggregate operation on each of those columnsaggfuncto{'total_bill': 'median', 'tip': 'max'}since we want to find the median value for thetotal_billcolumn and the maximum value for thetipcolumn
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc={'total_bill': 'median', 'tip': 'max'}).reset_index().rename(columns={'tip': 'max_tip', 'total_bill': 'median_total_bill'})
| day | time | max_tip | median_total_bill | |
|---|---|---|---|---|
| 0 | Thur | Lunch | 6.70 | 16.000 |
| 1 | Thur | Dinner | 3.00 | 18.780 |
| 2 | Fri | Lunch | 3.48 | 13.420 |
| 3 | Fri | Dinner | 4.73 | 18.665 |
| 4 | Sat | Dinner | 10.00 | 18.240 |
| 5 | Sun | Dinner | 6.50 | 19.630 |
Now, I'm curious for each day's meal type, what was the total amount of tips and total bills. I'd also like a new row for the sum of the total_tip and total_bill columns. To implement this, from pandas, we'll call the pivot_table() method and set the following arguments:
datato be our DataFramedf_tipsindexto be['day', 'time']since we want to aggregate by both of those columns so each row represents a unique type of meal for a dayvaluesas['total_bill', 'tip']since we want to perform a specific aggregate operation on each of those columnsaggfuncto'sum'since we want to find the sum (aka total) for each column passed to thevaluesargumentmarginstoTrueto add up the numerical columns
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
| day | time | total_tip | total_bill | |
|---|---|---|---|---|
| 0 | Thur | Lunch | 168.83 | 1077.55 |
| 1 | Thur | Dinner | 3.00 | 18.78 |
| 2 | Fri | Lunch | 16.68 | 89.92 |
| 3 | Fri | Dinner | 35.28 | 235.96 |
| 4 | Sat | Lunch | NaN | NaN |
| 5 | Sat | Dinner | 260.40 | 1778.40 |
| 6 | Sun | Lunch | NaN | NaN |
| 7 | Sun | Dinner | 247.39 | 1627.16 |
| 8 | All | 731.58 | 4827.77 |
Instead of the final row saying All, I'd prefer it say total. We can perform this change in our pivot_table() method. Utilize all the same code as above and we'll add a new argument called margins_name and set it to 'total'.
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total').reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
| day | time | total_tip | total_bill | |
|---|---|---|---|---|
| 0 | Thur | Lunch | 168.83 | 1077.55 |
| 1 | Thur | Dinner | 3.00 | 18.78 |
| 2 | Fri | Lunch | 16.68 | 89.92 |
| 3 | Fri | Dinner | 35.28 | 235.96 |
| 4 | Sat | Lunch | NaN | NaN |
| 5 | Sat | Dinner | 260.40 | 1778.40 |
| 6 | Sun | Lunch | NaN | NaN |
| 7 | Sun | Dinner | 247.39 | 1627.16 |
| 8 | total | 731.58 | 4827.77 |
I don't like how in the output above, we see NaN values in our columns. Those essentially mean there were no numerical values to perform in the aggregate operations so we're left with a value of NaN equivalent to not a number. We can fill in those NaN values with 0.
We can perform this change in our pivot_table() method. Utilize all the same code as above and we'll add a new argument called fill_value and set it to 0.
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total', fill_value=0).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
| day | time | total_tip | total_bill | |
|---|---|---|---|---|
| 0 | Thur | Lunch | 168.83 | 1077.55 |
| 1 | Thur | Dinner | 3.00 | 18.78 |
| 2 | Fri | Lunch | 16.68 | 89.92 |
| 3 | Fri | Dinner | 35.28 | 235.96 |
| 4 | Sat | Lunch | 0.00 | 0.00 |
| 5 | Sat | Dinner | 260.40 | 1778.40 |
| 6 | Sun | Lunch | 0.00 | 0.00 |
| 7 | Sun | Dinner | 247.39 | 1627.16 |
| 8 | total | 731.58 | 4827.77 |
Aggregate Operations¶
Other aggregate methods you could perform with a groupby() method in pandas are:
| method | description |
|---|---|
sum() |
summation |
mean() |
average |
count() |
count of all values |
size() |
count of non-null values |
max() |
maximum value |
min() |
minimum value |
std() |
standard deviation |
median() |
median |