crosstabs() Method: Compute Aggregated Metrics Across Categorical Columns¶
Date published: 2018-11-23
Category: Data Analysis
Subcategory: Data Wrangling
Tags: crosstabs, python, pandas
A crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.
Import Modules¶
import pandas as pd
import seaborn as sns
Get Tips Dataset¶
Let's get the tips
dataset from the seaborn
library and assign it to the DataFrame df_tips
.
df_tips = sns.load_dataset('tips')
Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:
column name | column description |
---|---|
total_bill |
financial amount of meal in U.S. dollars |
tip |
financial amount of the meal's tip in U.S. dollars |
sex |
gender of server |
smoker |
boolean to represent if server smokes or not |
day |
day of week |
time |
meal name (Lunch or Dinner) |
size |
count of people eating meal |
Preview the first 5 rows of df_tips
.
df_tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Implement Crosstabs with Tips Dataset¶
Grouping By a Single Field for the Index and Column¶
Let's compute a simple crosstab across the day
and sex
column. We want our returned index to be the unique values from day
and our returned columns to be the unique values from sex
. By default in pandas, the crosstab()
computes an aggregated metric of a count (aka frequency).
So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
sex | Male | Female |
---|---|---|
day | ||
Thur | 30 | 32 |
Fri | 10 | 9 |
Sat | 59 | 28 |
Sun | 58 | 18 |
One issue with this crosstab output is the column names are nonsensical. Just saying Male
or Female
isn't very specific. They should be renamed to be clearer. We can use the rename() method and set the argument columns
to be a dictionary in which the keys are the current column names and the values are the respective new names to set.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
sex | count_meals_served_by_males | count_meals_served_by_females |
---|---|---|
day | ||
Thur | 30 | 32 |
Fri | 10 | 9 |
Sat | 59 | 28 |
Sun | 58 | 18 |
Also, in the output above, see where it says sex
as the column name and day
for the row name? We can modify those names using arguments in the crosstab()
method. Let's set the colnames
argument to gender
since that's a more specific name than sex
.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gender | count_meals_served_by_males | count_meals_served_by_females |
---|---|---|
day | ||
Thur | 30 | 32 |
Fri | 10 | 9 |
Sat | 59 | 28 |
Sun | 58 | 18 |
In this example, we passed in two columns from our DataFrame. However, one nice feature of crosstab()
is that you don't need the data to be in a DataFrame. For the index
and columns
arguments, you can pass in two numpy arrays.
Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.
len(df_tips.query("sex=='Male' and day=='Thur'"))
30
Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the unstack()
method and then the reset_index()
method to pivot the DataFrame so each row is a unique combination of a value from sex
and day
with the appropriate count of meals served.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})
gender | day | count_meals_served | |
---|---|---|---|
0 | Male | Thur | 30 |
1 | Male | Fri | 10 |
2 | Male | Sat | 59 |
3 | Male | Sun | 58 |
4 | Female | Thur | 32 |
5 | Female | Fri | 9 |
6 | Female | Sat | 28 |
7 | Female | Sun | 18 |
For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument margins
to True
to get these totals. By default, the returned output will have a column and row name of All
.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gender | count_meals_served_by_males | count_meals_served_by_females | All |
---|---|---|---|
day | |||
Thur | 30 | 32 | 62 |
Fri | 10 | 9 | 19 |
Sat | 59 | 28 | 87 |
Sun | 58 | 18 | 76 |
All | 157 | 87 | 244 |
In the crosstab()
method, we can also rename the All
column. First, use all the same arguments from above. Then, set the argument margins_name
to count_meals_served
.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gender | count_meals_served_by_males | count_meals_served_by_females | count_meals_served |
---|---|---|---|
day | |||
Thur | 30 | 32 | 62 |
Fri | 10 | 9 | 19 |
Sat | 59 | 28 | 87 |
Sun | 58 | 18 | 76 |
count_meals_served | 157 | 87 | 244 |
For each cell value, we can calculate what percentage it is of the row's total. To do that, set the normalize
argument to 'index'
(since index applies to each row).
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
gender | proportion_meals_served_by_males | proportion_meals_served_by_females |
---|---|---|
day | ||
Thur | 0.483871 | 0.516129 |
Fri | 0.526316 | 0.473684 |
Sat | 0.678161 | 0.321839 |
Sun | 0.763158 | 0.236842 |
proportion_meals_served | 0.643443 | 0.356557 |
For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the normalize
argument to 'columns'
.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
gender | proportion_meals_served_by_males | proportion_meals_served_by_females | proportion_meals_served |
---|---|---|---|
day | |||
Thur | 0.191083 | 0.367816 | 0.254098 |
Fri | 0.063694 | 0.103448 | 0.077869 |
Sat | 0.375796 | 0.321839 | 0.356557 |
Sun | 0.369427 | 0.206897 | 0.311475 |
Given two categorical columns, the crosstab()
method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.
We know there exists total_bill
values in our datasets for males that served meals on Thursday. Below, I preview the first few total_bill
values that meet this criteria.
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()
77 27.20 78 22.76 79 17.29 80 19.44 81 16.66 Name: total_bill, dtype: float64
We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()
18.714666666666666
Now, we can perform this same operation using the crosstab()
method. Same as before, we want our returned index to be the unique values from day
and our returned columns to be the unique values from sex
. Additionally, we want the values inside the table to be from our total_bill
column so we'll set the argument values
to be df_tips['total_bill']
. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the aggfunc
argument to 'mean'
.
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], colnames=['gender'], aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males", "Female": "mean_bill_size_meals_served_by_females"})
gender | mean_bill_size_meals_served_by_males | mean_bill_size_meals_served_by_females |
---|---|---|
day | ||
Thur | 18.714667 | 16.715312 |
Fri | 19.857000 | 14.145556 |
Sat | 20.802542 | 19.680357 |
Sun | 21.887241 | 19.872222 |
This crosstab calculation outputted the same 18.71 value as expected!
We can pass in many other aggregate methods to the aggfunc
method too such as mean and standard deviation.
You can learn more about details of using crosstab()
from the official pandas documentation page.
Grouping By Multiple Fields for the Index and/or Columns¶
We can also use the crosstabs()
method to group by multiple pandas columns for the index
or columns
arguments.
For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from day
and our returned columns to be the unique values from all combinations of the sex
(gender) and time
(meal time) columns.
To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.
pd.crosstab(index=df_tips['day'], columns=[df_tips['sex'], df_tips['time']], colnames=['gender', 'meal']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
gender | Male | Female | ||
---|---|---|---|---|
meal | count_lunch_meals_served | count_dinner_meals_served | count_lunch_meals_served | count_dinner_meals_served |
day | ||||
Thur | 30 | 0 | 31 | 1 |
Fri | 3 | 7 | 4 | 5 |
Sat | 0 | 59 | 0 | 28 |
Sun | 0 | 58 | 0 | 18 |
We can verify the logic in a value with a separate query. I'm curious to see the count of meals on Thursday served by females during lunch time. Our result on this query is 31 which matches the value in the crosstab above!
len(df_tips.query("day=='Thur' and sex=='Female' and time=='Lunch'"))
31
If you're familiar with the groupby()
method in pandas, the code below performs a similar operation to the crosstab above, but the output format is a little different.
df_tips.groupby(['day', 'sex'])['time'].count()
day sex Thur Male 30 Female 32 Fri Male 10 Female 9 Sat Male 59 Female 28 Sun Male 58 Female 18 Name: time, dtype: int64
We can also group by multiple pandas columns in the index
argument of the crosstabs()
method.
Below is a fairly complex operation. The code below helps us find out for every combination of day of the week, gender and meal type what was the count of meals served for the various group sizes (ex - party of two people versus party of three).
In our output, the index includes unique values from the day
and sex
fields while columns include unique values from the time
and size
fields.
To interpret a value from our table below, across all Thursdays in our dataset, males 24 lunch meals that contained a group of 2 people.
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
meal | count_lunch_meals_served | count_dinner_meals_served | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
party_people_size | 1 | 2 | 3 | 4 | 5 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | |
day | sex | ||||||||||||
Thur | Male | 0 | 24 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Female | 1 | 23 | 2 | 3 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | |
Fri | Male | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 1 | 0 | 0 |
Female | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | |
Sat | Male | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | 13 | 11 | 1 | 0 |
Female | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 19 | 5 | 2 | 0 | 0 | |
Sun | Male | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 32 | 9 | 14 | 2 | 1 |
Female | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 6 | 4 | 1 | 0 |
We can verify 24 lunch meals for a party of 2 were served by males on Thursdays with the query below too.
len(df_tips.query("day=='Thur' and sex=='Male' and time=='Lunch' and size==2"))
24
Style Output¶
While the above output is daunting to read with all the cells, there's a cool trick in Seaborn to make your DataFrame output appear as a heatmap. I added some code based off the example from the pandas documention for built-in styles. Now, we can more easily identify the large values as being dark orange colors in our visualization below.
orange = sns.light_palette("orange", as_cmap=True)
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"}).style.background_gradient(cmap=orange)
meal | count_lunch_meals_served | count_dinner_meals_served | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
party_people_size | 1 | 2 | 3 | 4 | 5 | 6 | 1 | 2 | 3 | 4 | 5 | 6 | |
day | sex | ||||||||||||
Thur | Male | 0 | 24 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Female | 1 | 23 | 2 | 3 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | |
Fri | Male | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 1 | 0 | 0 |
Female | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | |
Sat | Male | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 34 | 13 | 11 | 1 | 0 |
Female | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 19 | 5 | 2 | 0 | 0 | |
Sun | Male | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 32 | 9 | 14 | 2 | 1 |
Female | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 6 | 4 | 1 | 0 |