Data Visualizations Best Practices Tutorial

When to Use Box Plots

Box plots help visualize the distribution of quantitative values in a field. They are also valuable for comparisons across different categorical variables or identifying outliers, if either of those exist in a dataset.

Box plots typically detail the minimum value, 25th percentile (aka Q1), median (aka 50th percentile), 75th percentile (aka Q3) and the maximum value in a visual manner.

Note: different software and libraries such as Microsoft Excel, Seaborn and others may place the end whiskers and show outliers differently on box plots. Please understand your software's implementation well when you need to interpret results.

Often times, the aspects of a box plot are:

Box plot visualization

You can learn more in detail about box and whisker plots through this Khan Academy article.

Percentiles are frequently used in comparisons in the real-world. For example, in my high school graduating class, my GPA ranked in the top 25th percentile. That means I had a higher GPA than 75% of students in my graduating class.

Below, I'll walk through several examples of when bar plots are useful.

Import Modules

import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

Set figure sizes to be larger and fonts to be larger.

sns.set(rc={'figure.figsize':(10, 6)})
sns.set_context("talk")

Example: Resting Heart Rate (Pulse)

In this public dataset, there's a sample of people's heart rates taken. To perform that measurement, people measured the number of times their heart beated in a single minute. The count of beats per minute is also called a pulse.

Load Exercise Dataset

df_exercise = sns.load_dataset('exercise')

Preview Exercise Dataset

Below, you can see a random sample of 5 rows of data. Note how each row represents health/exercise metrics for a single person and tracks their heart rate (pulse) as well as what kind of activity was done before the heart rate measurement.

df_exercise.sample(n=5)
Unnamed: 0 id diet pulse time kind
14 14 5 low fat 91 30 min rest
60 60 21 low fat 93 1 min running
78 78 27 no fat 100 1 min running
62 62 21 low fat 110 30 min running
9 9 4 low fat 80 1 min rest

Plot Resting Pulse Data

ax = sns.boxplot(x=df_exercise[df_exercise['kind']=='rest']['pulse'])
ax.axes.set_title("Box Plot of People's Resting Heart Rate", fontsize=20, y=1.01)
plt.xlabel("pulse [beats per minute]", labelpad=14);

png

Interpreting Pulse Data Quartiles

The median resting heart rate is roughly 92 beats per minute.

The minimum recorded resting heart rate is 80 beats per minute and the maximum is 100 beats per minute.

75% of people recorded a resting heart rate above 85.5 beats per minute. 25% of people recorded a resting heart rate above 95.75 beats per minute.

Also, in order to see exact numeric values of the quartiles in a box and whisker plot, you can also print out those values in a table format similar to the one below:

df_exercise[df_exercise['kind']=='rest']['pulse'].describe()
count     30.000000
mean      90.833333
std        5.831445
min       80.000000
25%       85.500000
50%       91.500000
75%       95.750000
max      100.000000
Name: pulse, dtype: float64

Example: Heart Rate Comparison for Resting, Walking and Running

In the example above, the visual box plot tells a similar story to the printed table results.

However, the visual representation of box plots becomes more valuable with side-to-side comparisons by a categorical variable. I want to know how the distribution of heart rate differs for people resting, walking and running. I'd assume that with more exercise activity, the median heart rate increases.

Box Plot for Heart Rate Comparisons by Activity

ax2 = sns.boxplot(x='kind', y='pulse', data=df_exercise, saturation=0.65)
ax2.axes.set_title("Box Plot of People's Heart Rate by Kind of Activity", fontsize=20, y=1.01)
plt.ylabel("pulse [beats per minute]", labelpad=14)
plt.xlabel("activity", labelpad=14);

png

Interpretation of Heart Rate by Activity

As expected, the median heart increases by the level of exercise activity. There's a significant jump in the median heart rate for those running from walking since running is a strenous exercise activity.

The distribution of recorded heart rates for those running varies much more than the distribution for those recorded after rest or walking. The maximum recorded heart rate for running is 150 beats per minute.

Example: Distribution of Total Bills by Day of Week

In this public dataset, there's records from restaurant orders. Specifically, we'll look at orders by day of week and the total bill amounts in U.S. dollars.

Get Tips Dataset

df_tips = sns.load_dataset('tips')

Preview Tips Dataset

Below, you can see a preview of 5 rows of the dataset. Note how each row represents meal order and there's fields for total bill amount and day of the week.

df_tips.sample(n=5)
total_bill tip sex smoker day time size
15 21.58 3.92 Male No Sun Dinner 2
102 44.30 2.50 Female Yes Sat Dinner 3
231 15.69 3.00 Male Yes Sat Dinner 3
206 26.59 3.41 Male Yes Sat Dinner 3
101 15.38 3.00 Female Yes Fri Dinner 2

Plot Distribution of Total Bill Amount by Day

ax3 = sns.boxplot(x="day", y="total_bill", data=df_tips, saturation=0.6)
ax3.axes.set_title("Box Plots of Total Bill Amounts by Day of Week", fontsize=20, y=1.01)
plt.xlabel("day", labelpad=14)
plt.ylabel("total bill [$]", labelpad=14);

png

Interpretation of Outliers for Thursday

The Python visualization library I use in the example above is called Seaborn. Their calculation of outliers in box plots is as so: any point in which the value is greater than (Q3-Q1)*1.5 + Q3.

For total_bill values on Thursday, the leftmost boxplot, we can see 5 outliers. Let's calculate that threshold that determines total_bills as outliers.

First, we need to identify the exact Q3 and Q1 values.

Q1 = df_tips[df_tips['day']=='Thur']['total_bill'].quantile(0.25)
Q3 = df_tips[df_tips['day']=='Thur']['total_bill'].quantile(0.75)
outlier_threshold = (Q3-Q1)*1.5 + Q3
round(outlier_threshold, 2)
31.72

Any total bill value greater than 31.72 U.S. dollars on Thursday is considered an outlier.

Let's examine the data to see how many outliers exist. The math below queries our tips dataset for orders on Thursday and greater than 31.72 U.S. dollars. We see 5 outliers. If we look at the Thursday box plot above, we see those 5 outliers plotted.

df_tips[(df_tips['day']=='Thur') & (df_tips['total_bill']>31.71)]['total_bill'].values
array([ 32.68,  34.83,  34.3 ,  41.19,  43.11])

Interpretation of Box Plots of Total Bill Amounts By Day

For total bill amounts on Thursday, the maximum non-outlier value is ~30 U.S. dollars.

Generally, people spend more money at this restaurant on weekends, Saturdays and Sundays, than weekdays since the median total bill of Saturday and Sunday are greater than the median values of Thursday and Friday.

On weekends, there's much more variance in people's spending patterns for meals than on weekdays.

Saturday has the highest recorded outlier at over 50 U.S. dollars.