Data Analysis Data Wrangling Tutorial

crosstabs() Method: Compute Aggregated Metrics Across Categorical Columns

A crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.

Import Modules

import pandas as pd
import seaborn as sns

Get Tips Dataset

Let's get the tips dataset from the seaborn library and assign it to the DataFrame df_tips.

df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name column description
total_bill financial amount of meal in U.S. dollars
tip financial amount of the meal's tip in U.S. dollars
sex gender of server
smoker boolean to represent if server smokes or not
day day of week
time meal name (Lunch or Dinner)
size count of people eating meal

Preview the first 5 rows of df_tips.

df_tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Implement Crosstabs with Tips Dataset

Grouping By a Single Field for the Index and Column

Let's compute a simple crosstab across the day and sex column. We want our returned index to be the unique values from day and our returned columns to be the unique values from sex. By default in pandas, the crosstab() computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18

One issue with this crosstab output is the column names are nonsensical. Just saying Male or Female isn't very specific. They should be renamed to be clearer. We can use the rename() method and set the argument columns to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
sex count_meals_served_by_males count_meals_served_by_females
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18

Also, in the output above, see where it says sex as the column name and day for the row name? We can modify those names using arguments in the crosstab() method. Let's set the colnames argument to gender since that's a more specific name than sex.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gender count_meals_served_by_males count_meals_served_by_females
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18

In this example, we passed in two columns from our DataFrame. However, one nice feature of crosstab() is that you don't need the data to be in a DataFrame. For the index and columns arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

len(df_tips.query("sex=='Male' and day=='Thur'"))
30

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the unstack() method and then the reset_index() method to pivot the DataFrame so each row is a unique combination of a value from sex and day with the appropriate count of meals served.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})
gender day count_meals_served
0 Male Thur 30
1 Male Fri 10
2 Male Sat 59
3 Male Sun 58
4 Female Thur 32
5 Female Fri 9
6 Female Sat 28
7 Female Sun 18

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument margins to True to get these totals. By default, the returned output will have a column and row name of All.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gender count_meals_served_by_males count_meals_served_by_females All
day
Thur 30 32 62
Fri 10 9 19
Sat 59 28 87
Sun 58 18 76
All 157 87 244

In the crosstab() method, we can also rename the All column. First, use all the same arguments from above. Then, set the argument margins_name to count_meals_served .

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
gender count_meals_served_by_males count_meals_served_by_females count_meals_served
day
Thur 30 32 62
Fri 10 9 19
Sat 59 28 87
Sun 58 18 76
count_meals_served 157 87 244

For each cell value, we can calculate what percentage it is of the row's total. To do that, set the normalize argument to 'index' (since index applies to each row).

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
gender proportion_meals_served_by_males proportion_meals_served_by_females
day
Thur 0.483871 0.516129
Fri 0.526316 0.473684
Sat 0.678161 0.321839
Sun 0.763158 0.236842
proportion_meals_served 0.643443 0.356557

For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the normalize argument to 'columns'.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
gender proportion_meals_served_by_males proportion_meals_served_by_females proportion_meals_served
day
Thur 0.191083 0.367816 0.254098
Fri 0.063694 0.103448 0.077869
Sat 0.375796 0.321839 0.356557
Sun 0.369427 0.206897 0.311475

Given two categorical columns, the crosstab() method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists total_bill values in our datasets for males that served meals on Thursday. Below, I preview the first few total_bill values that meet this criteria.

df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()
77    27.20
78    22.76
79    17.29
80    19.44
81    16.66
Name: total_bill, dtype: float64

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()
18.714666666666666

Now, we can perform this same operation using the crosstab() method. Same as before, we want our returned index to be the unique values from day and our returned columns to be the unique values from sex. Additionally, we want the values inside the table to be from our total_bill column so we'll set the argument values to be df_tips['total_bill']. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the aggfunc argument to 'mean'.

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], colnames=['gender'], aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males", "Female": "mean_bill_size_meals_served_by_females"})
gender mean_bill_size_meals_served_by_males mean_bill_size_meals_served_by_females
day
Thur 18.714667 16.715312
Fri 19.857000 14.145556
Sat 20.802542 19.680357
Sun 21.887241 19.872222

This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the aggfunc method too such as mean and standard deviation.

You can learn more about details of using crosstab() from the official pandas documentation page.

Grouping By Multiple Fields for the Index and/or Columns

We can also use the crosstabs() method to group by multiple pandas columns for the index or columns arguments.

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from day and our returned columns to be the unique values from all combinations of the sex (gender) and time (meal time) columns.

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

pd.crosstab(index=df_tips['day'], columns=[df_tips['sex'], df_tips['time']], colnames=['gender', 'meal']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
gender Male Female
meal count_lunch_meals_served count_dinner_meals_served count_lunch_meals_served count_dinner_meals_served
day
Thur 30 0 31 1
Fri 3 7 4 5
Sat 0 59 0 28
Sun 0 58 0 18

We can verify the logic in a value with a separate query. I'm curious to see the count of meals on Thursday served by females during lunch time. Our result on this query is 31 which matches the value in the crosstab above!

len(df_tips.query("day=='Thur' and sex=='Female' and time=='Lunch'"))
31

If you're familiar with the groupby() method in pandas, the code below performs a similar operation to the crosstab above, but the output format is a little different.

df_tips.groupby(['day', 'sex'])['time'].count()
day   sex   
Thur  Male      30
      Female    32
Fri   Male      10
      Female     9
Sat   Male      59
      Female    28
Sun   Male      58
      Female    18
Name: time, dtype: int64

We can also group by multiple pandas columns in the index argument of the crosstabs() method.

Below is a fairly complex operation. The code below helps us find out for every combination of day of the week, gender and meal type what was the count of meals served for the various group sizes (ex - party of two people versus party of three).

In our output, the index includes unique values from the day and sex fields while columns include unique values from the time and size fields.

To interpret a value from our table below, across all Thursdays in our dataset, males 24 lunch meals that contained a group of 2 people.

pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
meal count_lunch_meals_served count_dinner_meals_served
party_people_size 1 2 3 4 5 6 1 2 3 4 5 6
day sex
Thur Male 0 24 2 2 1 1 0 0 0 0 0 0
Female 1 23 2 3 0 2 0 1 0 0 0 0
Fri Male 1 2 0 0 0 0 0 6 0 1 0 0
Female 0 3 1 0 0 0 0 5 0 0 0 0
Sat Male 0 0 0 0 0 0 0 34 13 11 1 0
Female 0 0 0 0 0 0 2 19 5 2 0 0
Sun Male 0 0 0 0 0 0 0 32 9 14 2 1
Female 0 0 0 0 0 0 0 7 6 4 1 0

We can verify 24 lunch meals for a party of 2 were served by males on Thursdays with the query below too.

len(df_tips.query("day=='Thur' and sex=='Male' and time=='Lunch' and size==2"))
24

Style Output

While the above output is daunting to read with all the cells, there's a cool trick in Seaborn to make your DataFrame output appear as a heatmap. I added some code based off the example from the pandas documention for built-in styles. Now, we can more easily identify the large values as being dark orange colors in our visualization below.

orange = sns.light_palette("orange", as_cmap=True)
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"}).style.background_gradient(cmap=orange)

meal count_lunch_meals_served count_dinner_meals_served
party_people_size 1 2 3 4 5 6 1 2 3 4 5 6
day sex
Thur Male 0 24 2 2 1 1 0 0 0 0 0 0
Female 1 23 2 3 0 2 0 1 0 0 0 0
Fri Male 1 2 0 0 0 0 0 6 0 1 0 0
Female 0 3 1 0 0 0 0 5 0 0 0 0
Sat Male 0 0 0 0 0 0 0 34 13 11 1 0
Female 0 0 0 0 0 0 2 19 5 2 0 0
Sun Male 0 0 0 0 0 0 0 32 9 14 2 1
Female 0 0 0 0 0 0 0 7 6 4 1 0