Data Visualizations Best Practices Tutorial

When to Use Vertical Grouped Barplots

Vertical bar charts are useful to illustrate sizes of data using different bar heights. A vertical grouped barplot often illustrates the sizes of multiple categories using different bar heights.

For example, let's say we had a service that rented out scooters in San Francisco, California. Customers can make a one-time rental or pay a monthly subscription fee and get unlimited rides for under 30 minutes. So, for each ride, we log if the customer was a non-subscriber or subscriber. With these two categories for types of riders, we can see how they compare to one another with a multiple vertical bar chart.

I'll illustrate a few examples below of when vertical grouped bar plots are useful.

Import Modules

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
%matplotlib inline

Set visualization styles to have all fonts scaled larger and figure sizes larger.

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_context('talk')

Example: Scooter Rides Per Month Over Time by Account Type

Let's continue with the example mentioned above. I'm curious about the trend of individual versus subscriber scooter rides over the past year. To get a high-level perspective, I think it would be helpful to look at the count of rides per month for each of these account types.

A sample of our original data would look like:

Date Miles Ridden Account Type
6/9/2018 2.1 non-subscriber
6/10/2018 1.5 subscriber
6/10/2018 3.9 subscriber

We'd like get a count of rides on the monthly level broken out by account type. Therefore, given the data above, we'd perform a group by operation on a month-year field, then group by account type, and then count the number of rides. We'd end up with a sample of data looking like:

Month Year Count Non-Subscriber Rides Count Subscriber Rides
May 2017 31100 900
June 2017 33900 1100
July 2017 36380 1300

Generate Scooter Ride Data

month_list = [i.strftime("%b %Y") for i in pd.date_range(start='5-2017', end='5-2018', freq='MS')]
monthly_count_rides = [32000, 35000, 37680, 41500, 43300, 44000, 44350, 41000, 39000, 39500, 48000, 50000, 52000]
monthly_count_rides_subscribers = [900, 1100, 1300, 1800, 2860, 3300, 3350, 3480, 4005, 4790, 4980, 5150, 5290]
monthly_count_rides_individual = [month[0]-month[1] for month in zip(monthly_count_rides, monthly_count_rides_subscribers)]

Plot Scooter Rides

The reason a vertical grouped barplot works well in this scenario is because we're most interested to see the change in count of rides by each type - rather than a change in total rides of the account types combined. So, with separate bars for individual and subscription accounts, we can easily visualize the trend over time.

df = pd.DataFrame({'month_year': month_list, 'count_scooter_rides': monthly_count_rides, 
                    'count_scooter_rides_subscribers': monthly_count_rides_subscribers,
                   'count_scooter_rides_non-subscribers': monthly_count_rides_individual})

Preview df

df.head()
count_scooter_rides count_scooter_rides_non-subscribers count_scooter_rides_subscribers month_year
0 32000 31100 900 May 2017
1 35000 33900 1100 Jun 2017
2 37680 36380 1300 Jul 2017
3 41500 39700 1800 Aug 2017
4 43300 40440 2860 Sep 2017
df.set_index('month_year')[['count_scooter_rides_subscribers', 'count_scooter_rides_non-subscribers']].plot(kind='bar', figsize=(14, 10))
plt.xticks(rotation=60)
plt.title("Count of Scooter Rides Per Month By Account Type Over Time", fontsize=18, y=1.01)
plt.xlabel("date [month year]", labelpad=15)
plt.ylabel("count [rides]", labelpad=15)
plt.legend(["subscribers", "non-subscribers"], fontsize=16, title="rider type");

png

Explanation of Scooter Rides Plot

There is a clear trend that the amount of rides by subscribers has increased every month since May 2017. Also, the increase has been pretty significant as the count of subscriber rides per month has doubled from May 2017 to May 2018.

However, non-subscriber scooter rides increased during the warmer months of 2017 from May to November. During the winter season, count of non-subscriber scooter rides decreased per month over several months. By March 2018, when the weather was much warmer, the count of non-subscriber rides per month drastically increased.

Example: Favorite Sport to Play by Gender

Let's imagine we surveyed a group of friends to ask them their favorite sport. We recorded each person's response. We only allowed responses for the following sports:

  • basketball
  • baseball
  • soccer

A sample of our data would look like:

Name Gender Favorite Sport
Jake Male Basketball
Michele Female Baseball
Elizabeth Female Soccer

We want to visualize the data in two ways: 1) see favorite sports by gender and 2) see breakdown of a specific sport's interest by gender.

Generate Fictional Data

genders = []
random.seed(a=1) # ensures choices generated each time below are the same
for i in range(0, 26):
    gender = random.choice(['Female', 'Male'])
    genders.append(gender)
sports = ['Basketball']*6 + ['Baseball']*8 + ['Soccer']*12
df2 = pd.DataFrame({'favorite_sport': sports, 'gender': genders})

Preview df2.

df2.head()
favorite_sport gender
0 Basketball Female
1 Basketball Female
2 Basketball Male
3 Basketball Female
4 Basketball Male

Plot Categorized by Gender

ax = sns.countplot(x="gender", hue="favorite_sport", data=df2)
ax.axes.set_title("Breakdown of Genders' Favorite Sports", fontsize=22)
plt.legend(fontsize=15, title="sport");

png

Explanation of Plot Categorized by Gender

Our female friends majorly prefer soccer and our male friends nearly prefer all 3 sports equally.

Plot Categorized by Sport

ax = sns.countplot(x="favorite_sport", hue="gender", data=df2)
ax.axes.set_title("Breakdown of Favorite Sports by Gender", fontsize=22)
plt.xlabel("sport", labelpad=9);

png