When to Use Categorical Scatterplots¶
Date published: 2018-07-22
Category: Data Visualizations
Subcategory: Best Practices
Tags: scatter plot
Often times in businesses we want to visualize the distribution from several categorical variables provided in a dataset.
To display the distribution of a category of data, typically people use a box plot or histogram. However, sometimes those visualizations may be improperly used. If you have a small number of data points for one category - often 5 - 80 points, I'd recommend you start with a categorical scatter plot for comparison.
Below, I'll illustrate a few examples.
Import Modules¶
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline
Set Global Styles for Visualizations¶
sns.set_context('talk')
sns.set_style("darkgrid")
Example: Tips Dataset¶
In the examples below, I utilize the dataset tips
provided in the Seaborn visualization library.
Each row in this dataset is a record of a meal at a restaurant. For each meal, the restuarant recorded values for the total bill amount in U.S. dollars, tip amount, gender/sex of the waiter, day, meal and table size.
Load flights
Dataset¶
df_tips = sns.load_dataset("tips")
Preview the First 5 Rows of Data¶
df_tips.head(8)
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 |
Visualize Total Bill Amount by Table Size¶
This categorical scatter plot below, more specifically called a swarm plot
, helps illustrate the count of records for each table size and the distribution of bill amounts by table size.
A swarm plot is ideal here because we have so few records of meals with several table sizes such as 1, 5, and 6.
Alternatively, if we tried to use a histogram or box plot to illustrate these few records for category, we'd get a false representation of the bill amounts by table size. The reason for that is that it's bad practice to compare just 4 records for a table size of 1 to 156 record for a table size of 2. For the 4 records of table size of 1 person, those could be a false representation since they may have all made very very small purchases and therefore small total_bill
amounts in our dataset.
sns.catplot(x="size", y="total_bill", kind="swarm", data=df_tips, height=8.5, aspect=.9)
plt.xlabel("table size [number of people]", labelpad=15)
plt.ylabel("total bill amount [$]", labelpad=15)
plt.title("Distribution of Bill Amounts by Table Size", y=1.013);
Interpretation of Distribution of Bill Amounts by Table Size¶
There's few records of bill amounts for table sizes of 1, 5, and 6. The restaurant mostly gets table sizes of 2 and 3.
Generally, as the table size increases, total bill amount increases.
For a table size of 2, most bill amounts are between 10 and 20 U.S. dollars.
Visualize Total Bill Amount by Table Size and Gender/Sex¶
To improve on the visualization from above, I'm curious if there's any patterns of bill amounts for tables by waiter or waitresses' gender.
sns.catplot(x="size", y="total_bill", kind="swarm", hue='sex', data=df_tips, height=8.5, aspect=.9)
plt.xlabel("table size [number of people]", labelpad=15)
plt.ylabel("total bill amount [$]", labelpad=15)
plt.title("Distribution of Bill Amounts by Table Size", y=1.013);
Interpration of Visualization of Total Bill Amount by Table Size and Gender/Sex¶
There's no clear pattern of bill amounts by table size and gender.
At a glance, it's likely males have served more tables than females. Males often hold the top few spots for highest bill amount by table size too.