# Confidence Intervals¶

Date published: 2018-01-06

Category: Math

Subcategory: Inferential Statistics

Tags: confidence interval, z-score, standard error, statistics, standard deviation, normal distribution, python

**Confidence interval** is a range of values in which there's a specified probability that the expected true population parameter lies within it.

#### Import Modules¶

```
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import numpy as np
import random
import warnings
import matplotlib.pyplot as plt
% matplotlib inline
```

#### Visualization styling code¶

```
sns.set(rc={'figure.figsize':(13, 7.5)})
sns.set_context('talk')
```

#### Turn Off Warnings¶

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

```
warnings.filterwarnings('ignore')
```

### Generate Data of People's Mass in Pounds¶

Below, I generate two normal distributions using the numpy random module's `normal()`

method of mass values for men and females. I concatenate these two arrays and assign them to the column `us_people_mass_pounds`

in a DataFrame `df_ppl_mass`

.

```
np.random.seed(42)
normal_distribution_us_male_mass_pounds = np.random.normal(loc=181, scale=24, size=6000)
normal_distribution_us_female_mass_pounds = np.random.normal(loc=132, scale=22, size=6500)
all_mass_values = np.concatenate((normal_distribution_us_male_mass_pounds, normal_distribution_us_female_mass_pounds), axis=0)
df_ppl_mass = pd.DataFrame(data={'us_people_mass_pounds': all_mass_values})
```

Preview `df_ppl_mass`

.

```
df_ppl_mass.head()
```

us_people_mass_pounds | |
---|---|

0 | 192.921140 |

1 | 177.681657 |

2 | 196.544525 |

3 | 217.552717 |

4 | 175.380319 |

### View Distribution of U.S. People's Mass¶

Use the seaborn `distplot()`

method to create a histogram of the values in the column `us_people_mass_pounds`

.

```
sns.distplot(df_ppl_mass['us_people_mass_pounds'], color="darkslategrey")
plt.xlabel("mass [pounds]", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Distribution of Mass of People in U.S.", y=1.015, fontsize=20);
```

It's tough to characterize this distribution. It's got one large peak around 140 pounds, and it's not a normal distribution since there's no symmetry around a central value.

### Calculate Population Summary Statistics¶

#### Calculation Population Mean¶

```
pop_mean_mass = df_ppl_mass['us_people_mass_pounds'].mean()
pop_mean_mass
```

155.4232805942338

#### Calculate Population Standard Deviation¶

```
pop_std_dev_mass = df_ppl_mass['us_people_mass_pounds'].std()
pop_std_dev_mass
```

33.585190883958624

### Problem Setup¶

I later sampled 25 people who frequent the gym at least three time per week. In statistics, we call this the **treatment effect**. The **dependent variable** is still these people's mass in pounds.

The mean sample mass of people who frequent the gym at least three times per week is 163 pounds. I'm curious how this compares to our current population of people's mass. In order to compare this new sample of people from the gym, I need to compare it to an equivalent distribution of sample means from our population.

I need to calculate where this sample mean of 163 pounds would fall on the distribution of sample means from the masses of people from the U.S. population.

### Create List of Sample Means with $n=25$¶

Hypothetically, it's biased to simply select a single sample from our population and compare it to the gym goers.

Given our population mass values, I will take 300 samples each of 25 random values with replacement. For each sample, I will calculate the mean of the sample. I store all those sample means in the list `sample_means`

.

```
sample_means = []
n = 25
for sample in range(0, 300):
# random sampling done with replacement
sample_values = np.random.choice(a=df_ppl_mass['us_people_mass_pounds'], size=n)
sample_mean = np.mean(sample_values)
sample_means.append(sample_mean)
```

### View Distribution of Sample Means (Otherwise Called Sampling Distribution)¶

Let's view the distribution of these `sample_means`

values.

```
sns.distplot(sample_means)
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);
```

Wow! This distribution of sample means looks normal! The curve is fairly symmetrical around the central value and the median is roughly equivalent to the mean (see below). Based on the central limit theorem, sampling a sufficient number of times with a sufficient size will result in a normal distribution of sample means.

### Calculate Sampling Distribution ($n=25$) Summary Statistics¶

Below are summary statistic calculations for the sampling distribution visualized above.

#### Calculate Median of Sample Means¶

```
median_of_sample_means = np.mean(sample_means)
median_of_sample_means
```

155.12418330193915

#### Calculate Mean of Sample Means¶

```
mean_of_sample_means = np.mean(sample_means)
mean_of_sample_means
```

155.12418330193915

```
pop_mean_mass
```

155.4232805942338

This `mean_of_sample_means`

value is roughly equivalent to our population mean value assigned to the variable `pop_mean_mass`

. Based on the central limit theorem, this will always be the case!

#### Calculate Standard Deviation of Sample Means¶

```
std_dev_of_sample_means = np.std(sample_means)
std_dev_of_sample_means
```

6.734407481483423

### Equation for Standard Deviation of Sampling Distribution¶

The standard deviation of sample means is more commonly called the **standard error (SE)**. An interesting tidbit from the central limit theorem is that I can calculate this value off the population standard deviation and sample size. The equation for standard error is:

- $\sigma$ is population standard deviation
- $n$ is sample size

```
standard_error = pop_std_dev_mass/np.sqrt(n)
standard_error
```

6.717038176791725

```
std_dev_of_sample_means
```

6.734407481483423

This standard error value is the same as the value calculated above for `std_dev_of_sample_means`

.

### Compare New Sample Mean to our Sampling Distribution¶

Let's see how the sample of people from the gym compares to this sampling distribution. Remember, the sample mean of people from the gym was 163 pounds.

This value of 163 is a **point estimate** because it's a single point estimate.

I'd naively estimate all samples from this gym would have a mean of 163 pounds. However, if I were to collect additional samples from the gym, I wouldn't expect the mean of the sample means to be *exactly* 163.

```
gym_sample_mean = 163
```

#### Visualize Gym Sample Mean Compared to Sampling Distribution¶

```
sns.distplot(sample_means)
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.axvline(x=gym_sample_mean, linestyle='--', linewidth=2.5, label="sample mean of ppl at gym", c='orange')
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14)
plt.legend();
```

#### Calculate Proportion of Sample Means Less Than Gym Sample Mean¶

What's the probability of seeing a sample mean with a value less than 163?

Given this new sample mean of people at the gym, I can calculate the number of standard errors this value is from the mean of the sampling distribution. Let's calculate the z-score for `gym_sample_mean`

.

I can use the following z-score equation:

$$z=\frac{\bar{x}-\mu}{SE}$$- $\bar{x}$ is the sample mean
- $\mu$ is the population mean
- $SE$ is the standard error calculated as $\frac{\sigma }{\sqrt{n}}$

```
z_score = (gym_sample_mean - mean_of_sample_means)/standard_error
z_score
```

1.172513314763173

The `cdf()`

method from the scipy package and accompanying stats module returns the proportion of values smaller than the observation inputted for a standard normal distribution.

```
prop_values_less_than_gym_sample_mean = round(stats.norm.cdf(z_score), 3)
prop_values_less_than_gym_sample_mean
```

0.88

```
print("The probability of getting a sample mean less than the gym sample mean is {0}".format(prop_values_less_than_gym_sample_mean))
```

The probability of getting a sample mean less than the gym sample mean is 0.88

#### Visualize Proportion of Sample Means Less Than Gym Sample Mean¶

```
kde = stats.gaussian_kde(sample_means)
pos = np.linspace(np.min(sample_means), np.max(sample_means), 10000)
plt.plot(pos, kde(pos), color='teal')
shade = np.linspace(135, gym_sample_mean, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='teal',)
plt.text(x=154, y=.01, horizontalalignment='center', fontsize=15,
s="Sample mean of people at gym\ngreater than {0}% of sample means\nfrom the general population".format(round(prop_values_less_than_gym_sample_mean*100, 2)),
bbox=dict(facecolor='whitesmoke', boxstyle="round, pad=0.1"))
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);
```

### Critical Z-Scores¶

For the sampling distribution above, approximately 95% of the values will fall within 2 standard errors of the mean. Let's calculate the exact z-score values that bound 95% of the data. We know that the tails on each end should be 2.5% or 0.025 of the the total area under the curve of the sampling distribution.

Below I use a z-score of 1.96 to calculate a tail of 0.025.

```
tail = 1 - stats.norm.cdf(1.96)
tail
```

0.024997895148220484

The area under the curve between the z-critcal scores is approximately 0.95

```
1 - (tail * 2)
```

0.950004209703559

Let's calculate the exact bounds for the area in which 95% of sample means lie from the population mean.

```
lower_95_perc_bound = mean_of_sample_means - 1.96*standard_error
lower_95_perc_bound
```

141.95878847542738

```
upper_95_perc_bound = mean_of_sample_means + 1.96*standard_error
upper_95_perc_bound
```

168.28957812845093

```
kde = stats.gaussian_kde(sample_means)
pos = np.linspace(np.min(sample_means), np.max(sample_means), 10000)
plt.plot(pos, kde(pos), color='teal')
shade = np.linspace(lower_95_perc_bound, upper_95_perc_bound, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='teal',)
plt.text(x=154, y=.01, horizontalalignment='center', fontsize=17,
s="95% of sample means fall within\nthis shaded area of plus or minus\n1.96 z-scores from the mean",
bbox=dict(facecolor='whitesmoke', boxstyle="round, pad=0.1"))
plt.title("Distribution of Sample Means ($n=25$) of People's Mass in Pounds", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);
```