# pivot_table() Method: Pivot DataFrame with Aggregation Operation

- Nov 23 • 13 min read
- Key Terms: pivot table, python, pandas

**Pivot tables** allow us to perform group-bys on columns and specify aggregate metrics for columns too. This data analysis technique is very popular in GUI spreadsheet applications and also works well in Python using the pandas package and the DataFrame pivot_table() method.

### Import Modules

```
import pandas as pd
import seaborn as sns
import numpy as np
```

### Example 1: Pivot Tables with Flights Dataset

#### Get Data

Let's get the `flights`

dataset included in the `seaborn`

library and assign it to the DataFrame `df_flights`

.

```
df_flights = sns.load_dataset('flights')
```

Preview the first few rows of `df_flights`

.

Each row represents a month's flight history details. The `passengers`

column represents that total number of passengers that flew that month.

```
df_flights.head()
```

year | month | passengers | |
---|---|---|---|

0 | 1949 | January | 112 |

1 | 1949 | February | 118 |

2 | 1949 | March | 132 |

3 | 1949 | April | 129 |

4 | 1949 | May | 121 |

#### Implement Pivot Tables

I want to know the sum of passengers that flew on planes for each year. So, from pandas, we'll call the `pivot_table()`

method and set the following arguments:

`data`

to be our DataFrame`df_flights`

`index`

to be`'year'`

since that's the column from`df_flights`

that we want to appear as a unique value in each row`values`

as`'passengers'`

since that's the column we want to apply some aggregate operation on`aggfunc`

to`'sum'`

since we want to sum (aka total) up all values in`passengers`

that belong to a unique year

```
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum')
```

```
year
1949 1520
1950 1676
1951 2042
1952 2364
1953 2700
1954 2867
1955 3408
1956 3939
1957 4421
1958 4572
1959 5140
1960 5714
Name: passengers, dtype: int64
```

We can see above that every year, the total number of passengers that flew increased each year. However, this pivot table is poorly documented. There's no column name or label for the values on the right. To fix this, we'll cast this outputted series to a DataFrame and rename the aggregated column to be clearer.

Use the `reset_index()`

method to cast our series to a DataFrame with a new index and then chain the `rename()`

method to rename the `'passengers'`

column to `'total_passengers'`

.

```
pd.pivot_table(data=df_flights, index='year', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
```

year | total_passengers | |
---|---|---|

0 | 1949 | 1520 |

1 | 1950 | 1676 |

2 | 1951 | 2042 |

3 | 1952 | 2364 |

4 | 1953 | 2700 |

5 | 1954 | 2867 |

6 | 1955 | 3408 |

7 | 1956 | 3939 |

8 | 1957 | 4421 |

9 | 1958 | 4572 |

10 | 1959 | 5140 |

11 | 1960 | 5714 |

Now, I want to know the sum of passengers that flew per month in the dataset. So, from pandas, we'll call the the `pivot_table()`

method and include all of the same arguments above, except we'll set the `index`

to be `'month'`

since that's the column from `df_flights`

that we want to appear as a unique value in each row.

```
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='sum').reset_index().rename(columns={'passengers': 'total_passengers'})
```

month | total_passengers | |
---|---|---|

0 | January | 2901 |

1 | February | 2820 |

2 | March | 3242 |

3 | April | 3205 |

4 | May | 3262 |

5 | June | 3740 |

6 | July | 4216 |

7 | August | 4213 |

8 | September | 3629 |

9 | October | 3199 |

10 | November | 2794 |

11 | December | 3142 |

Our results indicate most people flew in the summer months of July and August.

Now, I want to know the average number of passengers that flew per month in the dataset. So, from pandas, we'll call the the `pivot_table()`

method and include all of the same arguments from the previous operation, except we'll set the `aggfunc`

to `'mean'`

since we want to find the mean (aka average) number of passengers that flew in each unique month.

I also rounded the `average_number_of_passengers`

column to 1 decimal place so it's easier to read.

```
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='mean').reset_index().rename(columns={'passengers': 'average_number_of_passengers'}).round(1)
```

month | average_number_of_passengers | |
---|---|---|

0 | January | 241.8 |

1 | February | 235.0 |

2 | March | 270.2 |

3 | April | 267.1 |

4 | May | 271.8 |

5 | June | 311.7 |

6 | July | 351.3 |

7 | August | 351.1 |

8 | September | 302.4 |

9 | October | 266.6 |

10 | November | 232.8 |

11 | December | 261.8 |

Now, I want to know the maximum number of passengers that flew per month in the dataset. So, from pandas, we'll call the the `pivot_table()`

method and include all of the same arguments from the previous operation, except we'll set the `aggfunc`

to `'max'`

since we want to find the maximum (aka largest) number of passengers that flew in each unique month.

```
pd.pivot_table(data=df_flights, index='month', values='passengers', aggfunc='max').reset_index().rename(columns={'passengers': 'max_number_of_passengers'})
```

month | max_number_of_passengers | |
---|---|---|

0 | January | 417 |

1 | February | 391 |

2 | March | 419 |

3 | April | 461 |

4 | May | 472 |

5 | June | 535 |

6 | July | 622 |

7 | August | 606 |

8 | September | 508 |

9 | October | 461 |

10 | November | 390 |

11 | December | 432 |

### Example 2: Pivot Tables with Tips Dataset

#### Get Data

Let's get the `tips`

dataset included in the `seaborn`

library and assign it to the DataFrame `df_tips`

.

```
df_tips = sns.load_dataset('tips')
```

Each row represents a unique meal for a party of people at a restaurant; the dataset includes the following fields:

column name | column description |
---|---|

`total_bill` |
financial amount of meal in U.S. dollars |

`tip` |
financial amount of the meal's tip in U.S. dollars |

`sex` |
gender of server |

`smoker` |
boolean to represent if server smokes or not |

`day` |
day of week |

`time` |
meal name (Lunch or Dinner) |

`size` |
count of people eating meal |

Preview the first 5 rows of `df_tips`

.

```
df_tips.head()
```

total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|

0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |

1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |

2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |

3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |

4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |

#### Implement Pivot Tables

For each day and meal type, I'm curious to find the median bill amount. So, from pandas, we'll call the `pivot_table()`

method and set the following arguments:

`data`

to be our DataFrame`df_tips`

`index`

to be`['day', 'time']`

since we want to aggregate by both of those columns so each row represents a unique type of meal for a day`values`

as`'total_bill'`

since that's the column we want to apply some aggregate operation on`aggfunc`

to`'median'`

since we want to find the median value in our`total_bill`

column for each day's meal type in our dataset

```
pd.pivot_table(data=df_tips, index=['day', 'time'], values='total_bill', aggfunc='median').reset_index().rename(columns={'total_bill': 'median_total_bill'})
```

day | time | median_total_bill | |
---|---|---|---|

0 | Thur | Lunch | 16.000 |

1 | Thur | Dinner | 18.780 |

2 | Fri | Lunch | 13.420 |

3 | Fri | Dinner | 18.665 |

4 | Sat | Lunch | NaN |

5 | Sat | Dinner | 18.240 |

6 | Sun | Lunch | NaN |

7 | Sun | Dinner | 19.630 |

Now, I'm curious for each day's meal type, what was the maximum tip value and median total bill. We can do this in one pivot table operation too! From pandas, we'll call the `pivot_table()`

method and set the following arguments:

`data`

to be our DataFrame`df_tips`

`index`

to be`['day', 'time']`

since we want to aggregate by both of those columns so each row represents a unique type of meal for a day`values`

as`['total_bill', 'tip']`

since we want to perform a specific aggregate operation on each of those columns`aggfunc`

to`{'total_bill': 'median', 'tip': 'max'}`

since we want to find the median value for the`total_bill`

column and the maximum value for the`tip`

column

```
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc={'total_bill': 'median', 'tip': 'max'}).reset_index().rename(columns={'tip': 'max_tip', 'total_bill': 'median_total_bill'})
```

day | time | max_tip | median_total_bill | |
---|---|---|---|---|

0 | Thur | Lunch | 6.70 | 16.000 |

1 | Thur | Dinner | 3.00 | 18.780 |

2 | Fri | Lunch | 3.48 | 13.420 |

3 | Fri | Dinner | 4.73 | 18.665 |

4 | Sat | Dinner | 10.00 | 18.240 |

5 | Sun | Dinner | 6.50 | 19.630 |

Now, I'm curious for each day's meal type, what was the total amount of tips and total bills. I'd also like a new row for the sum of the `total_tip`

and `total_bill`

columns. To implement this, from pandas, we'll call the `pivot_table()`

method and set the following arguments:

`data`

to be our DataFrame`df_tips`

`index`

to be`['day', 'time']`

since we want to aggregate by both of those columns so each row represents a unique type of meal for a day`values`

as`['total_bill', 'tip']`

since we want to perform a specific aggregate operation on each of those columns`aggfunc`

to`'sum'`

since we want to find the sum (aka total) for each column passed to the`values`

argument`margins`

to`True`

to add up the numerical columns

```
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
```

day | time | total_tip | total_bill | |
---|---|---|---|---|

0 | Thur | Lunch | 168.83 | 1077.55 |

1 | Thur | Dinner | 3.00 | 18.78 |

2 | Fri | Lunch | 16.68 | 89.92 |

3 | Fri | Dinner | 35.28 | 235.96 |

4 | Sat | Lunch | NaN | NaN |

5 | Sat | Dinner | 260.40 | 1778.40 |

6 | Sun | Lunch | NaN | NaN |

7 | Sun | Dinner | 247.39 | 1627.16 |

8 | All | 731.58 | 4827.77 |

Instead of the final row saying `All`

, I'd prefer it say `total`

. We can perform this change in our `pivot_table()`

method. Utilize all the same code as above and we'll add a new argument called `margins_name`

and set it to `'total'`

.

```
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total').reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
```

day | time | total_tip | total_bill | |
---|---|---|---|---|

0 | Thur | Lunch | 168.83 | 1077.55 |

1 | Thur | Dinner | 3.00 | 18.78 |

2 | Fri | Lunch | 16.68 | 89.92 |

3 | Fri | Dinner | 35.28 | 235.96 |

4 | Sat | Lunch | NaN | NaN |

5 | Sat | Dinner | 260.40 | 1778.40 |

6 | Sun | Lunch | NaN | NaN |

7 | Sun | Dinner | 247.39 | 1627.16 |

8 | total | 731.58 | 4827.77 |

I don't like how in the output above, we see `NaN`

values in our columns. Those essentially mean there were no numerical values to perform in the aggregate operations so we're left with a value of `NaN`

equivalent to not a number. We can fill in those `NaN`

values with `0`

.

We can perform this change in our `pivot_table()`

method. Utilize all the same code as above and we'll add a new argument called `fill_value`

and set it to `0`

.

```
pd.pivot_table(data=df_tips, index=['day', 'time'], values=['total_bill', 'tip'], aggfunc='sum', margins=True, margins_name='total', fill_value=0).reset_index().rename(columns={'tip': 'total_tip', 'total_bill': 'total_bill'})
```

day | time | total_tip | total_bill | |
---|---|---|---|---|

0 | Thur | Lunch | 168.83 | 1077.55 |

1 | Thur | Dinner | 3.00 | 18.78 |

2 | Fri | Lunch | 16.68 | 89.92 |

3 | Fri | Dinner | 35.28 | 235.96 |

4 | Sat | Lunch | 0.00 | 0.00 |

5 | Sat | Dinner | 260.40 | 1778.40 |

6 | Sun | Lunch | 0.00 | 0.00 |

7 | Sun | Dinner | 247.39 | 1627.16 |

8 | total | 731.58 | 4827.77 |

### Aggregate Operations

Other aggregate methods you could perform with a `groupby()`

method in pandas are:

method | description |
---|---|

`sum()` |
summation |

`mean()` |
average |

`count()` |
count of all values |

`size()` |
count of non-null values |

`max()` |
maximum value |

`min()` |
minimum value |

`std()` |
standard deviation |

`median()` |
median |

```
```