Data Visualizations Best Practices Tutorial

When to Use Histogram Plots

Histograms visualize the shape of the distribution for a single continuous variable that contains numerical values. A histogram displays data using bars of different heights.

Histograms are slightly similar to vertical bar charts; however, with histograms, numerical values are grouped into bins. For example, you could create a histogram of the mass (in pounds) of everyone at your university. In doing so, you'd need to create bins so that the mass of people from 40 pounds to 60 pounds is one bin, and 60 pounds to 80 pounds is another bin, and so forth.

In histogram plots, the bars should should have no spacing between them.

Similar to box plots, histograms visualize the distribution of a dataset. However, box plots are often more ideal for identifying outliers, if any exist.

Below, I'll walk through a few examples below of when histograms are useful.

Import Modules

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import warnings
% matplotlib inline

Set visualization styles to make all figure sizes and components larger.

sns.set(rc={'figure.figsize':(11.5, 8.5)})
sns.set_context("talk")

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

warnings.filterwarnings('ignore')

Example: Bay Area Bike Share Ride Duration Data

In the San Francisco Bay Area, a company Motivate operates a network of bikes across several cities. You can walk up to a bike, pay and unlock it from a dock, ride it to your destination, and park it in another nearby dock.

There's an option to become a member (aka subscriber) in which you pay a monthly subscription fee that includes unlimited rides that are up to 30 minutes long. When someone becomes a member, they can submit information on their birth year and gender.

For each ride, Motivate records data on the start time, end time, member birth year and member gender.

I'm curious to learn more about the duration of bike rides by customers.

Load Dataset on May 2018 Rides

df = pd.read_csv('201805-fordgobike-tripdata.csv')

Preview Some Data

df[['start_time', 'end_time', 'duration_sec', 'member_birth_year', 'member_gender']].head()
start_time end_time duration_sec member_birth_year member_gender
0 2018-05-31 21:41:51.4750 2018-06-01 13:28:22.7220 56791 NaN NaN
1 2018-05-31 18:39:53.7690 2018-06-01 09:19:51.5410 52797 1983.0 Male
2 2018-05-31 21:09:48.0150 2018-06-01 09:09:52.4850 43204 NaN NaN
3 2018-05-31 14:09:54.9720 2018-06-01 08:48:17.8150 67102 1979.0 Male
4 2018-05-31 16:07:23.8570 2018-06-01 08:28:47.2020 58883 1986.0 Male

Make New Column for Ride Time in Minutes

They record duration of rides in seconds because it's a granular metric of duration. However, when we discuss bike rides with friends, we typically say 20 minutes, not 1200 seconds.

Below, I create a new column to convert the duration of rides in seconds into minutes.

df['duration_minutes'] = df['duration_sec']/60

View Descriptive Statistics of Ride Time in Minutes

Below, we can see the shortest ride was 1 minute and the maximum was 1436 minutes - that's almost a 24 hour ride!

The 99th percentile value is 91.5 minutes, which is over three times as large as the 95th percentile value of 30.63 minutes.

df['duration_minutes'].describe(percentiles=[.25, .5, .75, .9, .95, .99, .999])
count    179125.000000
mean         14.248406
std          39.942553
min           1.016667
25%           5.700000
50%           9.133333
75%          14.500000
90%          22.600000
95%          30.633333
99%          91.500000
99.9%       687.312667
max        1436.800000
Name: duration_minutes, dtype: float64

Plot Histogram of Ride Time in Minutes

Below, I limit my histogram to show rides up to 91 minutes because that's still the 99th percentile. I don't want my visualization to be distorted by outliers. A focus on up to the 99th percentile will help us draw insights on the patterns of the majority of riders.

The Seaborn visualization library in Python automatically determines bin size using the Freedman-Diaconis rule. This is a very convenient feature to have!

sns.distplot(df['duration_minutes'], kde=False, color='b', hist_kws={"range": [0, 91]})
plt.xlabel("ride duration [minutes]", labelpad=14)
plt.ylabel("frequency", labelpad=14)
plt.title("Histogram of Duration of Bike Rides (in Minutes)", fontsize=20, y=1.01);

png

Interpretation of Bike Ride Duration Histogram

The highest occurence of bike rides are around 7 minutes.

Most bike rides are just from 3 - 15 minutes which I consider fairly short rides. Therefore, I could theorize that bay area bike share members infrequently use these bikes for long strenous exercises riding long distances. Rather, they likely use bike rides for short distances.

The histogram illustrates positive skew. This means there's a long tail on the right side of our peak. Because of this skew, the mean ride duration is larger than the median ride duration. Below, I printed out the mean and median so we can verify that while the mean ride duration is 14.24 minutes, the median is smaller at 9.13 minutes.

round(df['duration_minutes'].mean(), 2)
14.25
round(df['duration_minutes'].median(), 2)
9.13

Relative Frequency Histogram

Previouly, our histogram showed the frequency values on the y-axis. Another version of a histogram illustrates relative frequencies on the y-axis. This is helpful for visualizing the proportion of values in a certain range.

In addition to the arguments set in the histogram above, below I set bin to 27 and norm_hist to True. The norm_hist argument when set to True shows a density rather than a count on the y-axis.

sns.distplot(df['duration_minutes'], bins=27, kde=False, norm_hist=True, color='b', hist_kws={"range": [0, 91]})
plt.xlabel("ride duration [minutes]", labelpad=14)
plt.ylabel("relative frequency", labelpad=14)
plt.title("Relative Frequency Histogram of Duration of Bike Rides (in Minutes)", fontsize=20, y=1.01);

png

Interpretation of Bike Ride Duration Relative Frequency Histogram

Decreasing our bin size in this histogram to 27 increase convenience of interpretation but sacrifices some details.

I can make visual approximations now. Between a ride duration of 0 and 20 minutes there are 6 bars. Each bar covers a span of about 3.3 minutes. Therefore, approximately a proportion of 0.023 rides are between 0 and 3.3 minutes long.

Example: Age of Bike Riders

I'm curious to learn about the age distribution of members of the bay area bike share program.

Create New Column for Age (in Years)

They record the birth year of members. However, it's easier to interpret the age in years rather than year of birth.

df['age_years'] = 2018 - df['member_birth_year']

View Descriptive Statistics on Riders' Age

Below, we can see 18 is the youngest rider and the oldest is 129. However, the 99th percentile is 66 years old, nearly half the age of the maximum rider.

df['age_years'].describe(percentiles=[.25, .5, 0.75, 0.99, 0.9999])
count     167376.000000
mean          35.813575
std           10.320561
min           18.000000
25%           28.000000
50%           33.000000
75%           41.000000
99%           66.000000
99.99%       118.000000
max          129.000000
Name: age_years, dtype: float64

Plot Histogram of Age of Bike Riders

I limit the range of the x-axis (age) to be just 18 to 80 so we can more easily visualize the bulk of riders, and disregard the outliers well over 80.

df_age_members = df[df['age_years'].notnull()]['age_years']
sns.distplot(df_age_members, color='g', kde=False, hist_kws={"range": [18, 80]})
plt.title("Histogram of Age of Bike Members", fontsize=20, y=1.01)
plt.xlabel("age [years]", labelpad=14)
plt.ylabel("frequency", labelpad=14);

png

Interpretation of Histogram of Age of Bike Riders

The most frequent age group of riders is people aged 24 - 35 with a peak around 29.

Past roughly 35 years of age, as people get older, they're less likely to be a member of the bay area bike share program. However, there's still several thousand members who are 55 years or older.

There's a wide range of age of bike riders - from 18 to 60+.

This histogram is positively skewed too. There's a long tail of bins with ever-decreasing frequency that extend to the right of our peak.