Math Descriptive Statistics Article

Z-scores

A z-score is the number of standard deviations away from a mean for a data point. A z-score helps point out how unusual or usual a data point is from the other values.

Import Modules

import pandas as pd
import seaborn as sns
import scipy.stats as stats
import numpy as np
import warnings
import matplotlib.pyplot as plt
% matplotlib inline

Visualization styling code

sns.set(rc={'figure.figsize':(12, 7.5)})
sns.set_context('talk')

Turn Off Warnings

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

warnings.filterwarnings('ignore')

Example: Z-scores of U.S. Heights

I generate a pandas DataFrame to simulate heights of people in the U.S. from a normal distribution. Note: in order to calculate z-scores of values, you need a normal distribution.

np.random.seed(42)
population_size = 5000
df_heights = pd.DataFrame(data={'us_height_inches': np.random.normal(loc=66, scale=2.9, size=population_size)})

Preview df_heights.

df_heights.head()
us_height_inches
0 67.440471
1 65.599034
2 67.878297
3 70.416787
4 65.320955

Visualize the distribution of df_heights values.

sns.distplot(df_heights['us_height_inches'], color="maroon")
plt.xlabel("height [inches]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Heights of People in U.S.", y=1.015, fontsize=20);

png

Calculate the population mean height in the U.S. using the pandas series mean() method.

pop_mean_us_height_inches = df_heights['us_height_inches'].mean()
pop_mean_us_height_inches
66.01624559725403

Calculate the population standard deviation height in the U.S. using the pandas series std() method.

pop_std_dev_us_height_inches = df_heights['us_height_inches'].std()
pop_std_dev_us_height_inches
2.889791503580723

Given any person's height, I can calculate the number of standard deviations that height is from the mean by using the z-score equation:

$$ z-{score} = \frac{x - \mu}{\sigma } $$

Create a new column z-score that's the z-score for each person's height.

df_heights['us_z-score'] = (df_heights['us_height_inches']-pop_mean_us_height_inches)/pop_std_dev_us_height_inches

Preview df_heights.

df_heights.head()
us_height_inches us_z-score
0 67.440471 0.492847
1 65.599034 -0.144374
2 67.878297 0.644355
3 70.416787 1.522788
4 65.320955 -0.240602

Based on the z-scores computed, I can interpret the z-score values. For example, here's an interpretation for a few people:

for person in df_heights.itertuples():
    index_person = 0
    index_height_inches = 1
    index_z_score = 2

    index = person[index_person]
    height = round(person[index_height_inches], 2)
    z_score = round(person[index_z_score], 2)

    if index <= 4:
        print("This person at index {0} has an height in inches of {1} and is approximately {2} standard deviations from the U.S. population mean height in inches".format(index, height, z_score))
This person at index 0 has an height in inches of 67.44 and is approximately 0.49 standard deviations from the U.S. population mean height in inches
This person at index 1 has an height in inches of 65.6 and is approximately -0.14 standard deviations from the U.S. population mean height in inches
This person at index 2 has an height in inches of 67.88 and is approximately 0.64 standard deviations from the U.S. population mean height in inches
This person at index 3 has an height in inches of 70.42 and is approximately 1.52 standard deviations from the U.S. population mean height in inches
This person at index 4 has an height in inches of 65.32 and is approximately -0.24 standard deviations from the U.S. population mean height in inches

In a normal distribution, there is a standard range of values that fall within a certain number of standard deviations from the mean:

% of total height values range in distribution z-scores from mean
68 \(\mu\pm\sigma\) \(\pm1.96\)
95 \(\mu\pm2\sigma\) \(\pm2.33\)
99 \(\mu\pm3\sigma\) \(\pm2.575\)

Above, I standardized a distribution by getting every value's z-score. I can now visualize the distribution of z-scores below that correspond to specific height values. The histogram below looks like a normal distribution. The official term for it is a standard normal distribution.

df_heights['us_z-score'].hist(color='slategray')
plt.title("Standard Normal Distribution", y=1.015, fontsize=22)
plt.xlabel("z-score", labelpad=14)
plt.ylabel("frequency", labelpad=14);

png

Every point in our dataset is now illustrated as the number of standard deviations away from the mean - represented by its z-score.

In the distribution above, the standard deviation is 1.

z_score_distribution_std_dev = round(df_heights['us_z-score'].std(), 2)
z_score_distribution_std_dev
1.0

Comparison of Z-Scores from Two Populations

Above, we had data on heights of people in the U.S. My friend Leslie was born in the U.S. and is 63 inches. My other friend Jamie was born in the Phillipines and is 57 inches tall. I'm curious which of my friends is relatively taller for their respective country's height.

height_leslie_inches = 63
height_jamie_inches = 57

To the relative height for each population distribution, I'll calculate the z-score for each friend's height based on the country's population distribution using the z-score equation and round each value to two decimal places.

Using the z-score allows standardization among the two distributions of heights in the U.S. and Phillipines so we can easily compare where Leslie and Jamie reside on each distribution.

The code below generates a normal distribution of heights for people in the Philippines.

df_heights['philippines_height_inches'] = np.random.normal(loc=61, scale=3.2, size=population_size)
sns.distplot(df_heights['us_height_inches'], color="maroon", label='U.S. heights')
sns.distplot(df_heights['philippines_height_inches'], color='palegoldenrod', label='Philippines heights')
plt.axvline(x=height_leslie_inches, linestyle='--', linewidth=2.5, label="Leslie's height", c='indigo')
plt.axvline(x=height_jamie_inches, linestyle='--', linewidth=2.5, label="Jamie's height", c='slategrey')
plt.xlabel("height [inches]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Heights in the U.S. and Philippines", y=1.015, fontsize=20)
plt.legend();

png

Calculate the population mean height in the Philippines using the pandas series mean() method.

pop_mean_philippines_height_inches = df_heights['philippines_height_inches'].mean()
pop_mean_philippines_height_inches
60.96840353016177

Calculate the population standard deviation height in the Philippines using the pandas series std() method.

pop_std_dev_philippines_height_inches = df_heights['philippines_height_inches'].std()
pop_std_dev_philippines_height_inches
3.2333986708154616

Below is a function to calculate a value's z-score.

def z_score(value, population_mean, population_std_dev):
    """
    Function to calculate z-score as observation's value minus population mean divided by population standard deviation; round value to 2 decimal places.
    """
    return round((value - population_mean)/population_std_dev, 2)
z_score_leslie_us = z_score(height_leslie_inches, pop_mean_us_height_inches, pop_std_dev_us_height_inches) 
z_score_jamie_philippines = z_score(height_jamie_inches, pop_mean_philippines_height_inches, pop_std_dev_philippines_height_inches)

print("Leslie had a z-score of {0} for height relative to U.S. heights and Jamies has a z-score of {1} relative to Philippines heights".format(z_score_leslie_us, z_score_jamie_philippines))
Leslie had a z-score of -1.04 for height relative to U.S. heights and Jamies has a z-score of -1.23 relative to Philippines heights

Since Leslie's z-score is larger, she's relatively taller compared to people in her country than Jamie is.

There's also a method in the scipy stats module called the zscore() method that can calculate z-scores given just a list of values.

Probability Density Function

Above, we visualized the distribution of heights and included bars to indicate the frequency or probability for certain height values. We can visualize the same distribution below with probability of occurence on the y-axis and just include the density curve. The total area under this curve represents the probability of all outcomes happening, equal to \(1\).

sns.distplot(df_heights['us_height_inches'], color="maroon", hist=False)
plt.xlabel("height [inches]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Heights of People in U.S.", y=1.015, fontsize=20);

png

For any person's height, we can find the proportion of values greater than or less than that height by finding the appropriate area under the curve.

Probability of Occurence Given Z-Scores

With a normal distribution, given an observation's value, I can determine the proportion of observations above or below that value. The steps to do so are to calculate the observation's z-score and then find the appropriate area under the curve.

For example, Leslie's height is 63 inches. For what proportion of people is Leslie taller than in the U.S.?

In the scipy stats module, there's a norm (meaning a normal distribution) class with a cdf method that takes an argument of a z-score for an observation and returns the proportion of values less than that observation.

proportion_of_us_ppl_leslie_taller_than = round(stats.norm.cdf(z_score_leslie_us), 3)
proportion_of_us_ppl_leslie_taller_than
0.149

Leslie is taller than just 14.9% of people in the U.S. and she ranks at the 14.9th percentile.

I can visualize her height on the normal distribution and shade the area of heights for which she's taller.

kde = stats.gaussian_kde(df_heights['us_height_inches'])
pos = np.linspace(df_heights['us_height_inches'].min(), df_heights['us_height_inches'].max(), 50000)
plt.plot(pos, kde(pos), color='darkorange')
shade = np.linspace(55, height_leslie_inches, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='darkorange',)
plt.text(x=62, y=.010, horizontalalignment='center', fontsize=15, 
         s="Leslie is taller than this\n{0} shaded proportion\nof people in the U.S.".format(proportion_of_us_ppl_leslie_taller_than), 
         bbox=dict(facecolor='whitesmoke', boxstyle="round, pad=0.25"))
plt.title("Distribution of U.S. Heights and Leslie's Ranking", fontsize=20, y=1.012)
plt.xlabel("height [inches]", labelpad=15)
plt.ylabel("probability of occurence", labelpad=15);

png