Math Inferential Statistics Article

Confidence Intervals

Confidence interval is a range of values in which there's a specified probability that the expected true population parameter lies within it.

Import Modules

import pandas as pd
import seaborn as sns
import scipy.stats as stats
import numpy as np
import random
import warnings
import matplotlib.pyplot as plt
% matplotlib inline

Visualization styling code

sns.set(rc={'figure.figsize':(13, 7.5)})
sns.set_context('talk')

Turn Off Warnings

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

warnings.filterwarnings('ignore')

Generate Data of People's Mass in Pounds

Below, I generate two normal distributions using the numpy random module's normal() method of mass values for men and females. I concatenate these two arrays and assign them to the column us_people_mass_pounds in a DataFrame df_ppl_mass.

np.random.seed(42)
normal_distribution_us_male_mass_pounds = np.random.normal(loc=181, scale=24, size=6000)
normal_distribution_us_female_mass_pounds = np.random.normal(loc=132, scale=22, size=6500)
all_mass_values = np.concatenate((normal_distribution_us_male_mass_pounds, normal_distribution_us_female_mass_pounds), axis=0)
df_ppl_mass = pd.DataFrame(data={'us_people_mass_pounds': all_mass_values})

Preview df_ppl_mass.

df_ppl_mass.head()
us_people_mass_pounds
0 192.921140
1 177.681657
2 196.544525
3 217.552717
4 175.380319

View Distribution of U.S. People's Mass

Use the seaborn distplot() method to create a histogram of the values in the column us_people_mass_pounds.

sns.distplot(df_ppl_mass['us_people_mass_pounds'], color="darkslategrey")
plt.xlabel("mass [pounds]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Mass of People in U.S.", y=1.015, fontsize=20);

png

It's tough to characterize this distribution. It's got one large peak around 140 pounds, and it's not a normal distribution since there's no symmetry around a central value.

Calculate Population Summary Statistics

Calculation Population Mean

pop_mean_mass = df_ppl_mass['us_people_mass_pounds'].mean()
pop_mean_mass
155.4232805942338

Calculate Population Standard Deviation

pop_std_dev_mass = df_ppl_mass['us_people_mass_pounds'].std()
pop_std_dev_mass
33.585190883958624

Problem Setup

I later sampled 25 people who frequent the gym at least three time per week. In statistics, we call this the treatment effect. The dependent variable is still these people's mass in pounds.

The mean sample mass of people who frequent the gym at least three times per week is 163 pounds. I'm curious how this compares to our current population of people's mass. In order to compare this new sample of people from the gym, I need to compare it to an equivalent distribution of sample means from our population.

I need to calculate where this sample mean of 163 pounds would fall on the distribution of sample means from the masses of people from the U.S. population.

Create List of Sample Means with \(n=25\)

Hypothetically, if It's biased to simply select a single sample from our population

Given our population mass values, I will take 300 samples each of 25 random values with replacement. For each sample, I will calculate the mean of the sample. I store all those sample means in the list sample_means.

sample_means = []
n = 25
for sample in range(0, 300):
    # random sampling done with replacement
    sample_values = np.random.choice(a=df_ppl_mass['us_people_mass_pounds'], size=n)    
    sample_mean = np.mean(sample_values)
    sample_means.append(sample_mean)

View Distribution of Sample Means

Let's view the distribution of these sample_means values.

sns.distplot(sample_means)
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);

png

Wow! This distribution of sample means looks normal! The curve is fairly symmetrical around the central value and the median is roughly equivalent to the mean (see below). Based on the central limit theorem, sampling a sufficient number of times with a sufficient size will result in a normal distribution of sample means.

Calculate Sampling Distribution (\(n=25\)) Summary Statistics

Calculate Median of Sample Means

median_of_sample_means = np.mean(sample_means)
median_of_sample_means
155.12418330193915

Calculate Mean of Sample Means

mean_of_sample_means = np.mean(sample_means)
mean_of_sample_means
155.12418330193915
pop_mean_mass
155.4232805942338

This mean_of_sample_means value is roughly equivalent to our population mean value assigned to the variable pop_mean_mass. Based on the central limit theorem, this will always be the case!

Calculate Standard Deviation of Sample Means

std_dev_of_sample_means = np.std(sample_means)
std_dev_of_sample_means
6.734407481483423

Equation for Standard Deviation of Sampling Distribution

The standard deviation of sample means is more commonly called the standard error (SE). An interesting tidbit from the central limit theorem is that I can calculate this value off the population standard deviation and sample size. The equation for standard error is:

$$SE=\frac{\sigma }{\sqrt{n}}$$
  • \(\sigma\) is population standard deviation
  • \(n\) is sample size
standard_error = pop_std_dev_mass/np.sqrt(n)
standard_error
6.717038176791725
std_dev_of_sample_means
6.734407481483423

This standard error value is the same as the value calculated above for std_dev_of_sample_means.

Compare New Sample Mean to our Sampling Distribution

Let's see how the sample of people from the gym compares to this sampling distribution. Remember, the sample mean of people from the gym was 163 pounds.

This value of 163 is a point estimate because it's a single point estimate.

I'd naively estimate all samples from this gym would have a mean of 163 pounds. However, if I were to collect additional samples from the gym, I wouldn't expect the mean of the sample means to be exactly 163.

gym_sample_mean = 163

Visualize Gym Sample Mean Compared to Sampling Distribution

sns.distplot(sample_means)
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.axvline(x=gym_sample_mean, linestyle='--', linewidth=2.5, label="sample mean of ppl at gym", c='orange')
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14)
plt.legend();

png

Calculate Proportion of Sample Means Less Than Gym Sample Mean

What's the probability of seeing a sample mean with a value less than 163?

Given this new sample mean of people at the gym, I can calculate the number of standard errors this value is from the mean of the sampling distribution. Let's calculate the z-score for gym_sample_mean.

I can use the following z-score equation:

$$z=\frac{\bar{x}-\mu}{SE}$$
  • \(\bar{x}\) is the sample mean
  • \(\mu\) is the population mean
  • \(SE\) is the standard error calculated as \(\frac{\sigma }{\sqrt{n}}\)
z_score = (gym_sample_mean - mean_of_sample_means)/standard_error
z_score
1.172513314763173

The cdf() method from the scipy package and accompanying stats module returns the proportion of values smaller than the observation inputted for a normal distribution.

prop_values_less_than_gym_sample_mean =  round(stats.norm.cdf(z_score), 3)
prop_values_less_than_gym_sample_mean
0.88
print("The probability of getting a sample mean less than the gym sample mean is {0}".format(prop_values_less_than_gym_sample_mean))
The probability of getting a sample mean less than the gym sample mean is 0.88

Visualize Proportion of Sample Means Less Than Gym Sample Mean

kde = stats.gaussian_kde(sample_means)
pos = np.linspace(np.min(sample_means), np.max(sample_means), 10000)
plt.plot(pos, kde(pos), color='teal')
shade = np.linspace(135, gym_sample_mean, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='teal',)
plt.text(x=154, y=.01, horizontalalignment='center', fontsize=15, 
         s="Sample mean of people at gym\ngreater than {0}% of sample means".format(round(prop_values_less_than_gym_sample_mean*100, 2)), 
         bbox=dict(facecolor='whitesmoke', boxstyle="round, pad=0.1"))
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);

png

Critical Z-Scores

For the sampling distribution above, approximately 95% of the values will fall within 2 standard errors of the mean. Let's calculate the exact z-score values that bound 95% of the data. We know that the tails on each end should be 2.5% or 0.025 of values.

Below I use a z-score of 1.96 to calculate a tail of 0.025.

tail = 1 - stats.norm.cdf(1.96)
tail
0.024997895148220484

The area under the curve between the z-scores is approximately 0.95

1 - (tail * 2)
0.950004209703559

Let's calculate the exact bounds for the area in which 95% of sample means lie from the population mean.

lower_95_perc_bound = mean_of_sample_means - 1.96*standard_error
lower_95_perc_bound
141.95878847542738
upper_95_perc_bound = mean_of_sample_means + 1.96*standard_error
upper_95_perc_bound
168.28957812845093
kde = stats.gaussian_kde(sample_means)
pos = np.linspace(np.min(sample_means), np.max(sample_means), 10000)
plt.plot(pos, kde(pos), color='teal')
shade = np.linspace(lower_95_perc_bound, upper_95_perc_bound, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='teal',)
plt.text(x=154, y=.01, horizontalalignment='center', fontsize=17, 
         s="95% of sample means fall within\nthis shaded area of plus or minus\n1.96 z-scores from the mean", 
         bbox=dict(facecolor='whitesmoke', boxstyle="round, pad=0.1"))
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);

png

Other critical z-scores:

z-score % of area under a normal distribution curve
1.96 95
2.33 98
2.58 99

Confidence Interval for Gym Members' Mass in Pounds

If we sampled everyone who frequented the gym, we know there'd likely be a new population of values of people's mass, but we don't know where it'd be. There's a 95% chance we'd see the gym_sample_mean in this new distribution of masses for people that frequent the gym.

However, be aware that gym_sample_mean is a point estimate that's likely to have sampling error because the people in the sample may be underrepresented from other samples with those characteristics. It's possible most other samples of people who frequent the gym at least three times per week have a sample mean mass value far different than 163 pounds.

We can calculate where we'd expect 95% of the values to lie in this new distribution. 95% of new sample means of people from the gym would fall within 1.96 standard errors from the expected population mean.

\(\mu_{gym}\) is the new population mean of people's masses who frequent the gym. Our best guess is gym_sample_mean though we know it'll likely be in some interval around that.

We can express a generic formula for a 95% confidence interval for the expected population mass of gym goers as:

$$(\mu_{gym} - 1.96*\frac{\sigma}{\sqrt{n}}, \mu_{gym} + 1.96*\frac{\sigma}{\sqrt{n}}$$
  • \(1.96\) is the z-score for a 95% confidence interval (CI)
  • \(\frac{\sigma}{\sqrt{n}}\) is equivalent to the standard error (std. dev. of the sampling distribution)
  • \(\mu_{gym}\) is the population mean of the new people sampled that frequent the gym 3x per week

We'd expect 95% of sample means to fall within the interval above, so we can express that using our point estimate as:

$$(\mu_{gym} + 1.96*\frac{\sigma}{\sqrt{n}} < GymSampleMean < \mu_{gym} - 1.96*\frac{\sigma}{\sqrt{n}})$$

This equation above simplifies to:

$$GymSampleMean + 1.96*\frac{\sigma}{\sqrt{n}} > \mu_{gym} > GymSampleMean - 1.96*\frac{\sigma}{\sqrt{n}}$$

This term of \(z*SE\) above is commonly referred to as the margin of error.

Let's calculate these upper and lower bounds for our 95% confidence interval.

lower_bound_95_perc = gym_sample_mean - 1.96*standard_error
lower_bound_95_perc
149.83460517348823
upper_bound_95_perc = gym_sample_mean + 1.96*standard_error
upper_bound_95_perc
176.16539482651177

The generic formula for a confidence interal is:

$$\bar{x}-z*\frac{\sigma}{\sqrt{n}}, \bar{x}-z*\frac{\sigma}{\sqrt{n}}$$
  • \(\bar{x}\) is the sample mean
  • \(z*\frac{\sigma}{\sqrt{n}}\) is the magnitude of the numbero of standard errors away from a value

Visualize 95% Confidence Interval

sns.distplot(sample_means, hist=False)
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.axvline(x=gym_sample_mean, linestyle='--', linewidth=2.5, label="sample mean gym ppl (point estimate)", c='orange')
plt.axvline(x=lower_bound_95_perc, linestyle='--', linewidth=2.5, label="lower bound 95% CI (gym_sample_mean - 1.96*SE)", c='g')
plt.axvline(x=upper_bound_95_perc, linestyle='--', linewidth=2.5, label="upper bound 95% CI (gym_sample_mean + 1.96*SE)", c='purple')
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14)
plt.legend(bbox_to_anchor=(1.01, 1), loc="upper left");

png

The population mean could range from the green to the purple horizontal dotted lines above. Below is another way of visualizing that.

sns.distplot(sample_means, hist=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1f18d898>

png

values = np.random.normal(loc=lower_bound_95_perc, scale=standard_error, size=5000)
sns.distplot(values, hist=False)
values2 = np.random.normal(loc=upper_bound_95_perc, scale=standard_error, size=5000)
sns.distplot(values2, hist=False)
plt.axvline(x=gym_sample_mean, linestyle='--', linewidth=2.5, label="sample mean gym ppl (point estimate)", c='purple')
plt.axvline(x=lower_bound_95_perc, linestyle='--', linewidth=2, c='crimson', label='potential lower bound gym goers distr.')
plt.axvline(x=upper_bound_95_perc, linestyle='--', linewidth=2, c='gray', label='potential upper bound gym goers distr.')
plt.title("95% Confidence Interval for Bounds of Population of Masses of Frequent Gym Goers", y=1.015, fontsize=20)
bbox_props = dict(boxstyle="round", fc="w", ec="0.2", alpha=0.9)
plt.text(gym_sample_mean, 0.02, "Shaded area represents range\nof expected population\nmean ($\mu_{gym}$) - equivalent\nto interval estimate of 95% CI", ha="center", 
         va="center", size=16, bbox=bbox_props)
plt.axvspan(lower_bound_95_perc, upper_bound_95_perc, alpha=0.5, color='lightsteelblue')
plt.legend(bbox_to_anchor=(1.01, 1), loc="upper left");

png

The interval estimate for the shaded area above is:

$$(GymSampleMean - 1.96*\frac{\sigma}{\sqrt{n}}, GymSampleMean + 1.96*\frac{\sigma}{\sqrt{n}}$$

Effect of Sample Size on Confidence Interval

Our \(n\) for our frequent gym-goers sample was 25 people. If we took a sample with an even smaller \(n\), we'd get a larger range for our confidence interval bounds. The reason for this is because our standard error would be smaller.

Bigger samples give us more precise estimates by giving us a smaller confidence interval in which the true population mean likely lies in. This is a more ideal situation to help us estimate our population parameters.