# Z-scores

- Jan 4 • 11 min read
- Key Terms: z-score, statistics, standard deviation, normal distribution, python, pandas

A **z-score** is the number of standard deviations away from a mean for a data point. A z-score helps point out how *unusual* or *usual* a data point is from the other values. A z-score must be used with a normal distribution curve.

#### Import Modules

```
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import numpy as np
import warnings
import matplotlib.pyplot as plt
% matplotlib inline
```

#### Visualization styling code

```
sns.set(rc={'figure.figsize':(12, 7.5)})
sns.set_context('talk')
```

#### Turn Off Warnings

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

```
warnings.filterwarnings('ignore')
```

### Example: z-scores of U.S. Heights

I generate a pandas DataFrame to simulate heights of people in the U.S. from a normal distribution. Note: in order to calculate z-scores of values, you need a normal distribution.

```
np.random.seed(42)
population_size = 5000
df_heights = pd.DataFrame(data={'us_height_inches': np.random.normal(loc=66, scale=2.9, size=population_size)})
```

Preview `df_heights`

.

```
df_heights.head()
```

us_height_inches | |
---|---|

0 | 67.440471 |

1 | 65.599034 |

2 | 67.878297 |

3 | 70.416787 |

4 | 65.320955 |

Visualize the distribution of `df_heights`

values.

```
sns.distplot(df_heights['us_height_inches'], color="maroon")
plt.xlabel("height [inches]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Heights of People in U.S.", y=1.015, fontsize=20);
```

Calculate the population mean height in the U.S. using the pandas series mean() method.

```
pop_mean_us_height_inches = df_heights['us_height_inches'].mean()
pop_mean_us_height_inches
```

```
66.01624559725403
```

Calculate the population standard deviation height in the U.S. using the pandas series std() method.

```
pop_std_dev_us_height_inches = df_heights['us_height_inches'].std()
pop_std_dev_us_height_inches
```

```
2.889791503580723
```

Given any person's height, I can calculate the number of standard deviations that height is from the mean by using the z-score equation:

- \(x\) is a score
- \(\sigma\) is the population standard deviation
- \(\mu\) is the population mean

Create a new column `z-score`

that's the z-score for each person's height.

```
df_heights['us_z-score'] = (df_heights['us_height_inches']-pop_mean_us_height_inches)/pop_std_dev_us_height_inches
```

Preview `df_heights`

.

```
df_heights.head()
```

us_height_inches | us_z-score | |
---|---|---|

0 | 67.440471 | 0.492847 |

1 | 65.599034 | -0.144374 |

2 | 67.878297 | 0.644355 |

3 | 70.416787 | 1.522788 |

4 | 65.320955 | -0.240602 |

Based on the z-scores computed, I can interpret the z-score values. For example, here's an interpretation for a few people:

```
for person in df_heights.itertuples():
index_person = 0
index_height_inches = 1
index_z_score = 2
index = person[index_person]
height = round(person[index_height_inches], 2)
z_score = round(person[index_z_score], 2)
if index <= 4:
print("This person at index {0} has an height in inches of {1} and is approximately {2} standard deviations from the U.S. population mean height in inches.\n".format(index, height, z_score))
```

```
This person at index 0 has an height in inches of 67.44 and is approximately 0.49 standard deviations from the U.S. population mean height in inches.
This person at index 1 has an height in inches of 65.6 and is approximately -0.14 standard deviations from the U.S. population mean height in inches.
This person at index 2 has an height in inches of 67.88 and is approximately 0.64 standard deviations from the U.S. population mean height in inches.
This person at index 3 has an height in inches of 70.42 and is approximately 1.52 standard deviations from the U.S. population mean height in inches.
This person at index 4 has an height in inches of 65.32 and is approximately -0.24 standard deviations from the U.S. population mean height in inches.
```

In a normal distribution, there is a standard range of values that *generally* fall within a certain number of standard deviations from the mean:

% of total height values | range in distribution | z-scores from mean |
---|---|---|

\(68\) | \(\mu\pm\sigma\) | \(\pm1\) |

\(95\) | \(\mu\pm1.96\sigma\) | \(\pm1.96\) |

\(99\) | \(\mu\pm2.58\sigma\) | \(\pm2.58\) |

There are \(5000\) height values in this dataset. Therefore, it's likely approximately \(68\%\) of values are within \(\pm1\) z-scores from the mean. Let's verify this with our dataset.

```
values_plus_minus_one_z_score = len(df_heights[df_heights['us_z-score'].between(-1, 1)==True])
percent_values_plus_minus_one_z_score = values_plus_minus_one_z_score/len(df_heights)*100
percent_values_plus_minus_one_z_score
```

```
68.56
```

\(68.56\%\) of values fall within \(\pm1\) z-scores from the mean which is very close to \(68\%\)

Above, I standardized a distribution by getting every value's z-score. I can now visualize the distribution of z-scores below that correspond to specific height values. The histogram below looks like a normal distribution. The official term for it is a **standard normal distribution**.

```
df_heights['us_z-score'].hist(color='slategray')
plt.title("Standard Normal Distribution", y=1.015, fontsize=22)
plt.xlabel("z-score", labelpad=14)
plt.ylabel("frequency", labelpad=14);
```

Every point in our dataset is now illustrated as the number of standard deviations away from the mean - represented by its z-score.

In the distribution above, the standard deviation is 1.

```
z_score_distribution_std_dev = round(df_heights['us_z-score'].std(), 2)
z_score_distribution_std_dev
```

```
1.0
```

### Comparison of Z-Scores from Two Populations

Above, we had data on heights of people in the U.S. My friend Leslie was born in the U.S. and is 63 inches. My other friend Jamie was born in the Phillipines and is 57 inches tall. I'm curious which of my friends is relatively *taller* for their respective country's height.

```
height_leslie_inches = 63
height_jamie_inches = 57
```

To get the *relative* height for each population distribution, I'll calculate the z-score for each friend's height based on the country's population distribution using the z-score equation and round each value to two decimal places. For this example, I created normal distributions of heights for values in the Phillipines and United States.

Using the z-score allows standardization among the two distributions of heights in the U.S. and Phillipines so we can easily compare where Leslie and Jamie reside on each distribution.

The code below generates a normal distribution of heights for people in the Philippines.

```
df_heights['philippines_height_inches'] = np.random.normal(loc=61, scale=3.2, size=population_size)
```

The visualization below illustrates the distribution of heights for people in the U.S., Phillippines as well as Leslie's and Jamie's height.

```
sns.distplot(df_heights['us_height_inches'], color="maroon", label='U.S. heights')
sns.distplot(df_heights['philippines_height_inches'], color='palegoldenrod', label='Philippines heights')
plt.axvline(x=height_leslie_inches, linestyle='--', linewidth=2.5, label="Leslie's height", c='indigo')
plt.axvline(x=height_jamie_inches, linestyle='--', linewidth=2.5, label="Jamie's height", c='slategrey')
plt.xlabel("height [inches]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Heights in the U.S. and Philippines", y=1.015, fontsize=20)
plt.legend();
```

Calculate the population mean height in the Philippines using the pandas series mean() method.

```
pop_mean_philippines_height_inches = df_heights['philippines_height_inches'].mean()
pop_mean_philippines_height_inches
```

```
60.96840353016177
```

Calculate the population standard deviation height in the Philippines using the pandas series std() method.

```
pop_std_dev_philippines_height_inches = df_heights['philippines_height_inches'].std()
pop_std_dev_philippines_height_inches
```

```
3.2333986708154616
```

Below is a function to calculate a value's z-score.

```
def z_score(value, population_mean, population_std_dev):
"""
Function to calculate z-score as observation's value minus population mean divided by population standard deviation.
Round value to 2 decimal places.
"""
return round((value - population_mean)/population_std_dev, 2)
```

```
z_score_leslie_us = z_score(height_leslie_inches, pop_mean_us_height_inches, pop_std_dev_us_height_inches)
z_score_jamie_philippines = z_score(height_jamie_inches, pop_mean_philippines_height_inches, pop_std_dev_philippines_height_inches)
print("Leslie has a z-score of {0} for her height relative to U.S. heights and Jamie has a z-score of {1} relative to Philippines heights.".format(z_score_leslie_us, z_score_jamie_philippines))
```

```
Leslie had a z-score of -1.04 for height relative to U.S. heights and Jamies has a z-score of -1.23 relative to Philippines heights
```

Since Leslie's z-score is larger (closer to \(0\)), she's *relatively* taller compared to people in her country than Jamie is.

There's also a method in the scipy stats module called the `zscore()`

method that can calculate z-scores given just a list of values.

### Probability Density Function

Above, we visualized the distribution of heights and included bars to indicate the frequency or probability for certain height values. We can visualize the same distribution below with `probability of occurence`

on the y-axis and just include the density curve. The total area under this curve represents the probability of all outcomes happening which is equal to \(1\).

```
sns.distplot(df_heights['us_height_inches'], color="maroon", hist=False)
plt.xlabel("height [inches]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Heights of People in U.S.", y=1.015, fontsize=20);
```

For any person's height, we can find the proportion of values greater than or less than that height by finding the appropriate area under the curve.

### Probability of Occurence Given Z-Scores

With a normal distribution, given an observation's value, I can determine the proportion of observations above or below that value. The steps to do so are to calculate the observation's z-score and then find the appropriate area under the curve on one side of the value.

For example, Leslie's height is 63 inches. For what proportion of people is Leslie taller than in the U.S.?

In the scipy stats module, there's a norm (meaning a normal distribution) class with a `cdf`

method that takes an argument of a z-score for an observation and returns the proportion of values less than that observation.

```
proportion_of_us_ppl_leslie_taller_than = round(stats.norm.cdf(z_score_leslie_us), 3)
proportion_of_us_ppl_leslie_taller_than
```

```
0.149
```

This proportion value of \(0.149\) is equivalent to \(14.9\%\). Leslie is taller than just 14.9% of people in the U.S. and she ranks at the 14.9th percentile.

I can visualize her height on the normal distribution and shade the area of heights for which she's taller.

```
kde = stats.gaussian_kde(df_heights['us_height_inches'])
pos = np.linspace(df_heights['us_height_inches'].min(), df_heights['us_height_inches'].max(), 50000)
plt.plot(pos, kde(pos), color='darkorange')
shade = np.linspace(55, height_leslie_inches, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='darkorange',)
plt.text(x=62, y=.010, horizontalalignment='center', fontsize=15,
s="Leslie is taller than this\n{0} shaded proportion\nof people in the U.S.".format(proportion_of_us_ppl_leslie_taller_than),
bbox=dict(facecolor='whitesmoke', boxstyle="round, pad=0.25"))
plt.title("Distribution of U.S. Heights and Leslie's Ranking", fontsize=20, y=1.012)
plt.xlabel("height [inches]", labelpad=15)
plt.ylabel("probability of occurence", labelpad=15);
```