crosstabs() Method: Compute Aggregated Metrics Across Categorical Columns¶

Date published: 2018-11-23

Category: Data Analysis

Subcategory: Data Wrangling

Tags: crosstabs, python, pandas

A crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.

Import Modules¶

In [122]:

                
                    Copied!
                    
import pandas as pd
import seaborn as sns
import pandas as pd
import seaborn as sns

Get Tips Dataset¶

Let's get the tips dataset from the seaborn library and assign it to the DataFrame df_tips.

In [123]:

                
                    Copied!
                    
df_tips = sns.load_dataset('tips')
df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name	column description
`total_bill`	financial amount of meal in U.S. dollars
`tip`	financial amount of the meal's tip in U.S. dollars
`sex`	gender of server
`smoker`	boolean to represent if server smokes or not
`day`	day of week
`time`	meal name (Lunch or Dinner)
`size`	count of people eating meal

Preview the first 5 rows of df_tips.

In [124]:

                
                    Copied!
                    
df_tips.head()
df_tips.head()

Out[124]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Implement Crosstabs with Tips Dataset¶

Grouping By a Single Field for the Index and Column¶

Let's compute a simple crosstab across the day and sex column. We want our returned index to be the unique values from day and our returned columns to be the unique values from sex. By default in pandas, the crosstab() computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

In [125]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

Out[125]:

sex	Male	Female
day
Thur	30	32
Fri	10	9
Sat	59	28
Sun	58	18

One issue with this crosstab output is the column names are nonsensical. Just saying Male or Female isn't very specific. They should be renamed to be clearer. We can use the rename() method and set the argument columns to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

In [126]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[126]:

sex	count_meals_served_by_males	count_meals_served_by_females
day
Thur	30	32
Fri	10	9
Sat	59	28
Sun	58	18

Also, in the output above, see where it says sex as the column name and day for the row name? We can modify those names using arguments in the crosstab() method. Let's set the colnames argument to gender since that's a more specific name than sex.

In [127]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[127]:

gender	count_meals_served_by_males	count_meals_served_by_females
day
Thur	30	32
Fri	10	9
Sat	59	28
Sun	58	18

In this example, we passed in two columns from our DataFrame. However, one nice feature of crosstab() is that you don't need the data to be in a DataFrame. For the index and columns arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

In [128]:

                
                    Copied!
                    
len(df_tips.query("sex=='Male' and day=='Thur'"))
len(df_tips.query("sex=='Male' and day=='Thur'"))

Out[128]:

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the unstack() method and then the reset_index() method to pivot the DataFrame so each row is a unique combination of a value from sex and day with the appropriate count of meals served.

In [129]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})

Out[129]:

	gender	day	count_meals_served
0	Male	Thur	30
1	Male	Fri	10
2	Male	Sat	59
3	Male	Sun	58
4	Female	Thur	32
5	Female	Fri	9
6	Female	Sat	28
7	Female	Sun	18

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument margins to True to get these totals. By default, the returned output will have a column and row name of All.

In [130]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[130]:

gender	count_meals_served_by_males	count_meals_served_by_females	All
day
Thur	30	32	62
Fri	10	9	19
Sat	59	28	87
Sun	58	18	76
All	157	87	244

In the crosstab() method, we can also rename the All column. First, use all the same arguments from above. Then, set the argument margins_name to count_meals_served .

In [131]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[131]:

gender	count_meals_served_by_males	count_meals_served_by_females	count_meals_served
day
Thur	30	32	62
Fri	10	9	19
Sat	59	28	87
Sun	58	18	76
count_meals_served	157	87	244

For each cell value, we can calculate what percentage it is of the row's total. To do that, set the normalize argument to 'index' (since index applies to each row).

In [132]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})

Out[132]:

gender	proportion_meals_served_by_males	proportion_meals_served_by_females
day
Thur	0.483871	0.516129
Fri	0.526316	0.473684
Sat	0.678161	0.321839
Sun	0.763158	0.236842
proportion_meals_served	0.643443	0.356557

For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the normalize argument to 'columns'.

In [133]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True, margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})

Out[133]:

gender	proportion_meals_served_by_males	proportion_meals_served_by_females	proportion_meals_served
day
Thur	0.191083	0.367816	0.254098
Fri	0.063694	0.103448	0.077869
Sat	0.375796	0.321839	0.356557
Sun	0.369427	0.206897	0.311475

Given two categorical columns, the crosstab() method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists total_bill values in our datasets for males that served meals on Thursday. Below, I preview the first few total_bill values that meet this criteria.

In [134]:

                
                    Copied!
                    
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()

Out[134]:

77    27.20
78    22.76
79    17.29
80    19.44
81    16.66
Name: total_bill, dtype: float64

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

In [135]:

                
                    Copied!
                    
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()

Out[135]:

18.714666666666666

Now, we can perform this same operation using the crosstab() method. Same as before, we want our returned index to be the unique values from day and our returned columns to be the unique values from sex. Additionally, we want the values inside the table to be from our total_bill column so we'll set the argument values to be df_tips['total_bill']. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the aggfunc argument to 'mean'.

In [136]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], colnames=['gender'], aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males", "Female": "mean_bill_size_meals_served_by_females"})
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], colnames=['gender'], aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males", "Female": "mean_bill_size_meals_served_by_females"})

Out[136]:

gender	mean_bill_size_meals_served_by_males	mean_bill_size_meals_served_by_females
day
Thur	18.714667	16.715312
Fri	19.857000	14.145556
Sat	20.802542	19.680357
Sun	21.887241	19.872222

This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the aggfunc method too such as mean and standard deviation.

You can learn more about details of using crosstab() from the official pandas documentation page.

Grouping By Multiple Fields for the Index and/or Columns¶

We can also use the crosstabs() method to group by multiple pandas columns for the index or columns arguments.

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from day and our returned columns to be the unique values from all combinations of the sex (gender) and time (meal time) columns.

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

In [137]:

                
                    Copied!
                    
pd.crosstab(index=df_tips['day'], columns=[df_tips['sex'], df_tips['time']], colnames=['gender', 'meal']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
pd.crosstab(index=df_tips['day'], columns=[df_tips['sex'], df_tips['time']], colnames=['gender', 'meal']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})

Out[137]:

gender	Male		Female
meal	count_lunch_meals_served	count_dinner_meals_served	count_lunch_meals_served	count_dinner_meals_served
day
Thur	30	0	31	1
Fri	3	7	4	5
Sat	0	59	0	28
Sun	0	58	0	18

We can verify the logic in a value with a separate query. I'm curious to see the count of meals on Thursday served by females during lunch time. Our result on this query is 31 which matches the value in the crosstab above!

In [138]:

                
                    Copied!
                    
len(df_tips.query("day=='Thur' and sex=='Female' and time=='Lunch'"))
len(df_tips.query("day=='Thur' and sex=='Female' and time=='Lunch'"))

Out[138]:

If you're familiar with the groupby() method in pandas, the code below performs a similar operation to the crosstab above, but the output format is a little different.

In [139]:

                
                    Copied!
                    
df_tips.groupby(['day', 'sex'])['time'].count()
df_tips.groupby(['day', 'sex'])['time'].count()

Out[139]:

day   sex   
Thur  Male      30
      Female    32
Fri   Male      10
      Female     9
Sat   Male      59
      Female    28
Sun   Male      58
      Female    18
Name: time, dtype: int64

We can also group by multiple pandas columns in the index argument of the crosstabs() method.

Below is a fairly complex operation. The code below helps us find out for every combination of day of the week, gender and meal type what was the count of meals served for the various group sizes (ex - party of two people versus party of three).

In our output, the index includes unique values from the day and sex fields while columns include unique values from the time and size fields.

To interpret a value from our table below, across all Thursdays in our dataset, males 24 lunch meals that contained a group of 2 people.

In [140]:

                
                    Copied!
                    
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"})

Out[140]:

	meal	count_lunch_meals_served						count_dinner_meals_served
	party_people_size	1	2	3	4	5	6	1	2	3	4	5	6
day	sex
Thur	Male	0	24	2	2	1	1	0	0	0	0	0	0
Thur	Female	1	23	2	3	0	2	0	1	0	0	0	0
Fri	Male	1	2	0	0	0	0	0	6	0	1	0	0
Fri	Female	0	3	1	0	0	0	0	5	0	0	0	0
Sat	Male	0	0	0	0	0	0	0	34	13	11	1	0
Sat	Female	0	0	0	0	0	0	2	19	5	2	0	0
Sun	Male	0	0	0	0	0	0	0	32	9	14	2	1
Sun	Female	0	0	0	0	0	0	0	7	6	4	1	0

We can verify 24 lunch meals for a party of 2 were served by males on Thursdays with the query below too.

In [141]:

                
                    Copied!
                    
len(df_tips.query("day=='Thur' and sex=='Male' and time=='Lunch' and size==2"))
len(df_tips.query("day=='Thur' and sex=='Male' and time=='Lunch' and size==2"))

Out[141]:

Style Output¶

While the above output is daunting to read with all the cells, there's a cool trick in Seaborn to make your DataFrame output appear as a heatmap. I added some code based off the example from the pandas documention for built-in styles. Now, we can more easily identify the large values as being dark orange colors in our visualization below.

In [145]:

                
                    Copied!
                    
orange = sns.light_palette("orange", as_cmap=True)
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"}).style.background_gradient(cmap=orange)
orange = sns.light_palette("orange", as_cmap=True)
pd.crosstab(index=[df_tips['day'], df_tips['sex']], columns=[df_tips['time'], df_tips['size']], colnames=['meal', 'party_people_size']).rename(columns={"Lunch": "count_lunch_meals_served", "Dinner": "count_dinner_meals_served"}).style.background_gradient(cmap=orange)

Out[145]:

	meal	count_lunch_meals_served						count_dinner_meals_served
	party_people_size	1	2	3	4	5	6	1	2	3	4	5	6
day	sex
Thur	Male	0	24	2	2	1	1	0	0	0	0	0	0
Thur	Female	1	23	2	3	0	2	0	1	0	0	0	0
Fri	Male	1	2	0	0	0	0	0	6	0	1	0	0
Fri	Female	0	3	1	0	0	0	0	5	0	0	0	0
Sat	Male	0	0	0	0	0	0	0	34	13	11	1	0
Sat	Female	0	0	0	0	0	0	2	19	5	2	0	0
Sun	Male	0	0	0	0	0	0	0	32	9	14	2	1
Sun	Female	0	0	0	0	0	0	0	7	6	4	1	0