Math Inferential Statistics Article

Hypothesis Testing: One and Two-Tailed Tests

Hypothesis testing is is validating if a sample parameter (such as the mean) is significantly different than a population parameter when the sample mean has a treatment effect.

I'll explain why the word significantly is in italics later on.

Before reading this post, I'd recommend familiarity with the following terms: z-scores, normal distribution, standard deviation, standard error and the central limit theorem.

Import Modules

import seaborn as sns
import scipy.stats as stats
import numpy as np
import random
import warnings
import matplotlib.pyplot as plt
% matplotlib inline

Visualization styling code

sns.set(rc={'figure.figsize':(13, 7.5)})
sns.set_context('talk')

Turn Off Warnings

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

warnings.filterwarnings('ignore')

Hypothesis Tests Order of Steps

Below are the steps used for performing a hypothesis test to compare a single sample group with treatment to a population and determine the statistical significance. This article will mainly detail steps 2 - 4 and 6 - 8.

1) Take measurements from a large control group we'll call the population.

2) Declare a null and alternative hypothesis. - This information provides background for a one-tail or two-tail test.

3) Decide on an alpa level - the probability of obtaining a sample mean.

4) Apply a treatment to a new sample group of size \(n\) and record measurements. - We want to compare this treatment group to our population to see if this treatment had an effect.

5) Take a sufficient number of samples of size \(n\) from the population values and record a single statistic of each sample such as the mean or median. This is called the sampling distribution. - A distribution of these sample means (or medians) would be a normal distribution based on the Central Limit Theorem.

6) Calculate the standard deviation of the sampling distribution - called the standard error.

7) Compute how many standard errors the statistic (such as the mean) of the sample group with treatment is from the mean of the sampling distribution. This is called the z-score.

8) Determine if this sample group with treatment is significantly different based on the pre-decided alpha level.

One-Tail Hypothesis Tests

Visualization of One-Tail

This visualization shades just the right-tail of the distribution that's an area of just 0.05 equivalent to 5%.

values = np.random.normal(loc=0, scale=10, size=6000)
two_std_from_mean = np.mean(values) + np.std(values)*1.645
kde = stats.gaussian_kde(values)
pos = np.linspace(np.min(values), np.max(values), 10000)
plt.plot(pos, kde(pos), color='teal')
shade = np.linspace(two_std_from_mean, 40, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='teal')
plt.title("Distribution of Sample Means for One-Tail Hypothesis Test", y=1.015, fontsize=20)
plt.xlabel("sample mean value", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);

For a one-tail test, the critical region (shaded above) can be on the left or right side of the distribution. The critical region defines statistically unlikely values.

Levels of Likelihood for One-Tail Test

Below represent values for the probability of getting a sample mean from a sampling distribution along with the equivalent percentage likelihood and z-score.

The probability values are called \(\alpha\) levels and all are unlikely occurences. I call it unlikely because this sample mean likely didn't occur by random chance. There may have been an effect on that sample mean.

probability of obtaining sample mean (\(\alpha\) level) equivalent percentage likelihood z-score cutoff for one tail
0.05 5% 1.645
0.01 1% 2.33
0.001 0.1% 3.1

Under a normal distribution, a z-score of roughly 1.645 has an area under the curve to the left of the z-value of 0.95 (95% probability). Any z-score of 1.65 or greater means you're unlikely to sample such a sample from the sampling distribution. The region to the right of that z-value is called the critical region. In this instance, the z-score is called the z-critical value.

The cdf() method from the scipy package and accompanying stats module returns the proportion of values smaller than the observation inputted for a normal distribution. Let's see the area under the curve to the left of the z-score for the z-scores listed in the table above.

round(stats.norm.cdf(1.645), 3)
round(stats.norm.cdf(2.33), 3)
round(stats.norm.cdf(3.1), 3)

For any sample mean, we can interpret if it's statistically significant - essentially telling us how likely or unlikely this sample mean is within the sampling distribution. The table below only shows examples for a one-tailed hypothesis test.

If the probability of getting a particular sample mean is less than \(\alpha\), it is "unlikely" to occur.

Note below, \(p\) is equivalent to probability and \(\bar{x}\) is the sample mean.

The table below shows interpretations of popular significance levels. Square brackets such as \([\) signify inclusivity of the value next to it while parentheses such as \((\) signify exclusivity of the value next to it.

z-score range statistical interpretation layman's interpretation
[1.645, 2.33) \(\bar{x}\) is significant at p<0.05 probability of sampling \(\bar{x}\) from the sampling distribution is less than 0.05
[2.33, 3.1) \(\bar{x}\) is significant at p<0.01 probability of sampling \(\bar{x}\) from the sampling distribution is less than 0.01
[3.1, \(\infty\)) \(\bar{x}\) is significant at p<0.001 probability of sampling \(\bar{x}\) from the sampling distribution is less than 0.001

Additionally, the z-score ranges in the table above could be all positive or negative for a one-sided test depending on the initial hypotheses.

When constructing a hypothesis test, it is best to choose a significant level such as the ones above before you perform a test. You can later report the results as significant at a certain critical level after obtaining the result.

If you simply analyze the results for all statistical significance levels, you may be "fishing" for results that don't meet the original purpose of your hypothesis test.

Two-Tailed Hypothesis Tests

This visualization shades two equal areas on each tail of the normal distribution of sample means.

Each tail of the distribution has a shaded area of 0.025.

values = np.random.normal(loc=0, scale=10, size=6000)
alpha_05_positive = np.mean(values) + np.std(values)*1.96
alpha_05_negative = np.mean(values) - np.std(values)*1.96
kde = stats.gaussian_kde(values)
pos = np.linspace(np.min(values), np.max(values), 10000)
plt.plot(pos, kde(pos), color='dodgerblue')
shade = np.linspace(alpha_05_positive, 40, 300)
plt.fill_between(shade, kde(shade), alpha=0.45, color='dodgerblue')
shade2 = np.linspace(alpha_05_negative, -40, 300)
plt.fill_between(shade2, kde(shade2), alpha=0.45, color='dodgerblue')
plt.title("Distribution of Sample Means for Two-Tail Hypothesis Test", y=1.015, fontsize=20)
plt.xlabel("sample mean value", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14);

Below represent values for the probability of getting a sample mean from a sampling distribution along with the equivalent percentage likelihood and z-score.

probability of obtaining sample mean (\(\alpha\) level) probability on each tail z-score cutoff for each tail
0.05 0.025 \(\pm1.96\)
0.01 0.005 \(\pm2.575\)
0.001 0.0005 \(\pm3.29\)

Let's see the area under the curve to the left of the z-score for the z-scores listed in the table above.

round(1-stats.norm.cdf(1.96), 3)
round(1-stats.norm.cdf(2.575), 3)
round(1-stats.norm.cdf(3.29), 3)

The table below shows interpretations of popular significance levels. Square brackets such as \([\) signify inclusivity of the value next to it while parentheses such as \((\) signify exclusivity of the value next to it.

z-score range interpretation
[1.96, 2.575) \(\bar{x}\) is significant at p<0.05
[2.575, 3.29) \(\bar{x}\) is significant at p<0.01
[3.29, \(\infty\)) \(\bar{x}\) is significant at p<0.001

Hypotheses

In the field of inferential statistics, we make hypotheses - proposed explanations typically made on the basis of limited evidence that's used as a starting point for further investigation.

The steps for a simple hypothesis test are:

1) Make a hypothesis

2) Choose an alpha level

3) Collect evidence

4) Calculate the probability of obtaining that sample mean with treatment

5) Interpret results as statistically significant or not

The null hypothesis commonly denoted as \(H_{o}\) assumes no significant difference between current population parameters and what will be the new population parameters after some intervention (otherwise called a treatment). This is expressed in math notation for comparisons of means as \(\mu = \mu_{i}\) (with the \(i\) representing an intervention). However the two sides don't have to be exactly equal. Rather, they shouldn't be significantly different from one another.

The alternative hypothesis commonly denoted as \(H_{a}\) guesses there will be a significant difference between the current population parameters and the new population parameters after some intervention.

The three possible scenarios for one and two-tailed tests with explanations for each are as follows:

alternative hypothesis interpretation number of tails tail(s) of distribution to reject the null
\(\mu < \mu_{i}\) current population parameter will be less than the new population parameter after an intervention one right
\(\mu > \mu_{i}\) current population parameter will be greater than the new population parameter after an intervention one left
\(\mu \neq \mu_{i}\) no prediction on a direction for the treatment two left or right

Side note: the two-tailed test is most conservative because the probability on each tail is smaller than a tail on a one-sided test.

Interpretation of Hypotheses

Here's what it means to reject the null hypothesis:

  • The sample mean falls within the critical region.
  • The z-score of the sample mean is greater than the z-critical value.
  • The probability of obtaining the sample mean is less than the alpha level.

In the scenario of the alternative hypothesis, the sample mean will lie in the critical region.

We can't prove the null hypothesis is true. We can only obtain evidence to reject the null hypothesis.

Example Hypotheses

Hypothesis Example #1: Tree Branches

\(H_{o}\): Most trees have more than 20 branches (most = more than 50%).

\(H_{a}\): Most trees have less than 20 branches.

The 50% represents the criteria for our alpha level. This is an example of a one-sided test since there's a direction of less than the current population parameters.

\(H_{o}\): \(\mu = \mu_{i}\)

\(H_{a}\): \(\mu > \mu_{i}\)

Let's say we sample 10 trees and find that all have more than 20 branches.

This sample is evidence that most (greater than 50%) trees have more than 20 branches. We will fail to reject the null hypothesis.

Hypothesis Example #2: Tree Branches

\(H_{o}\): Most trees have more than 20 branches (most = more than 50%).

\(H_{a}\): Most trees have less than 20 branches.

The 50% represents the criteria for our alpha level. This is an example of a one-sided test since there's a direction of less than the current population parameters.

\(H_{o}\): \(\mu = \mu_{i}\)

\(H_{a}\): \(\mu > \mu_{i}\)

Let's say we sample 10 trees and find 6 trees have less than 20 branches.

This sample is evidence that most (greater than 50%) trees have less than 20 branches. This is evidence to reject the null hypothesis.

Hypothesis Example #3: Personal Trainer at Gym Effect on Mass

Null hypothesis: having Joe as a personal trainer for three weightlifting workouts per week over the course of a year has no effect on one's mass.

Alternative hypothesis: having Joe as a personal trainer for three weightlifting workouts per week over the course of a year has an effect on one's mass.

\(H_{o}\): \(\mu = \mu_{joe}\)

\(H_{a}\): \(\mu \neq \mu_{joe}\)

To gather data for the sampling distribution, assume I visited the gym at 25 different times and each time sampled 20 people on their mass who weightlifted at the gym on average three times per week over the last year, but didn't have Joe as a personal trainer.

To gather data for the treatment group, assume I visited the gym one day and sampled 20 people on their mass who weightlifted at the gym on average three times per week over the past year with Joe's personal training guidance.

To reject the null hypothesis, the sample mean with treatment could be on either end of the sampling distribution. This is considered a two-tailed test.

Let's set an alpha level for this experiment of \(0.05\).

Below are our initial statistics of the population and treatment groups.

population_mean_pounds = 160
N = 500
population_std_dev_pounds = 23

n = 20
treatment_sample_mean_pounds = 171

Let's compute the standard error which is the standard deviation of sample means. The equation is below:

$$SE=\frac{\sigma }{\sqrt{n}}$$
  • \(\sigma\) is population standard deviation
  • \(n\) is sample size
standard_error_pounds = population_std_dev_pounds / np.sqrt(n)
standard_error_pounds

Let's calculate how many standard errors treatment_sample_mean_pounds is from population_mean_pounds and express this as a z-score.

I can use the following z-score equation:

$$z=\frac{\bar{x}-\mu}{SE}$$
  • \(\bar{x}\) is the sample mean
  • \(\mu\) is the population mean
  • \(SE\) is the standard error calculated as \(\frac{\sigma }{\sqrt{n}}\)
z_score = (treatment_sample_mean_pounds - population_mean_pounds)/standard_error_pounds
z_score

Let's visualize where treatment_sample_mean_pounds lies on the distribution of sample mean masses of groups collected at the gym that didn't have the treatment of Joe's personal training.

population_gym_goers_mass = np.random.normal(loc=population_mean_pounds, scale=population_std_dev_pounds, size=N)
sns.distplot(population_gym_goers_mass, hist=False)
plt.axvline(x=treatment_sample_mean_pounds, linestyle='--', linewidth=2.5, label="sample mean with Joe personal trainer", c='purple')
plt.title("Distribution of Sample Means of Gym Goers' Mass [Pounds]", y=1.015, fontsize=20)
plt.xlabel("sample mean mass [pounds]", labelpad=14)
plt.ylabel("frequency of occurence", labelpad=14)
plt.legend();

Our initial alpha level was \(0.05\). The equivalent z-score cutoff for each tail is \(\pm1.96\). Above, we calculated a z-score value of \(2.14\) which is greater than \(1.96\).

treatment_sample_mean_pounds is significant at p<0.05 so we have obtained sufficient evidence to reject the null hypothesis. There's evidence of an effect of mass gain through 3x per week personal training sessions with Joe over the course of a year.

Hypothesis Testing Caveats

This type of statistical analysis is prone to misinterpretation. It's possible that those sampled and treated by Joe's personal training already had a history of major weight gain through lifting or had not hid a plateua yet with their potential to gain weight. So, they could have been ideal candidates for Joe to push them to gain muscle rapidly. If either of those scenarios are the case, it's possible that the new estimated population parameters with the treatment are slightly biased.

Given this misinterpretion or another issue, our hypothesis testing may result in an error.

Decision Errors

In the figure below, each quadrant has a meaning about correct or incorrect decisions in hypothesis testing.

Typically, in hypothesis testing, the "truth about the population" is unknown at first. For the example of Joe's personal training, the true population mean of people's weight would be if we could apply the treatment to a sufficiently large number of samples and calculate the mean of sample means with this treatment.

I provide explanations of each quadrant (image source) below.

hypothesis testing decision errors

The top left quadrant means the ground truth is that the null hypothesis is true, yet our research concluded to reject the null hypothesis (significant effect from the treatment). We made an incorrect statistical decision. This is considered a Type 1 error.

The top right quadrant means the ground truth is that the alternative hypothesis is true and our research concluded to reject the null hypothesis (significant effect from the treatment). This is an ideal situation.

The bottom left quadrant means the ground truth is that the null hypothesis is true and our research concluded that there's no evidence to reject the null hypothesis. This is an ideal situation.

The bottom right quadrant means the ground truth is that the alternative hypothesis is true, yet our research concluded that there's no evidence to reject the null hypothesis. We made an incorrect statistical decision. This is considered a Type 2 error.

How to Reduce Decision Errors

We minimize our chances of making the wrong decision when we have a large enough sample size, we randomize our sample and when we implmement proper experimental controls.

Example: Statistical Decision Errors in Rain Scenario

\(H_{o}\): It's not going to rain later today (so I don't need an umbrella).

\(H_{a}\): It's going to rain later today (so I should bring an umbrella).

Four possible scenarios:

  • I think it's going to rain later so I bring an umbrella. However, later it doesn't rain. This is a Type 1 error.
  • I think it's going to rain later so I bring an umbrella. I'm correct - it rained later and I was right in bringing an umbrella. (This is equivalent to the top right quadrant.)
  • I think it's not going to rain later so I don't bring an umbrella. I'm correct - it doesn't rain later so I was correct in not bringing an umbrella. (This is equivalent to the bottom left quadrant.)
  • I think it's not going to rain later so I don't bring an umbrella. However, later it rains. This is a Type 2 error.

Continue Gym Personal Trainer Example with Hypothesis Test (of Known True Population Mean)

Earlier, I concluded treatment_sample_mean_pounds was significant at p<0.05 because I obtained sufficient evidence to reject the null hypothesis. There was evidence of an effect of mass gain through 3x per week personal training sessions with Joe over the course of a year.

Additionally, let's assume we found the true population mean of people's mass sampled at the gym for everyone after one year of 3x per week personal training sessions with Joe was 162 pounds. In real life, we wouldn't know this value of 162 because the treatment hasn't taken effect on the population in a study. However, in this example, let's pretend we do.

Let's assign the variable true_population_mean_pounds_with_joe_training to 162.

true_population_mean_pounds_with_joe_training = 162

In the equation for the z-score, we can utilize this new value to see how many standard errors sample_mean_pounds lies from true_population_mean_pounds_with_joe_training.

z_true = (treatment_sample_mean_pounds - true_population_mean_pounds_with_joe_training)/standard_error_pounds
z_true

Our initial alpha level was \(0.05\). The equivalent z-score cutoff for each tail is \(\pm1.96\).

The z_true value is less than the z-critical value of 1.96; the ground truth about the population \(H_{o}\) is true. Having Joe as a persoanl trainer for three weightlifting workouts per week over the course of the year has no effect on one's mass. However, our earlier decision based on the sample with treatment was to reject the null hypothesis. In this instance, we committed a Type 1 error!

Effect of Parameters on Treatment Effect

For hypothesis testing, we typically use the following two equations to calculate the standard error (SE) and a z-score for a sample mean:

$$SE=\frac{\sigma }{\sqrt{n}}$$
  • \(\sigma\) is population standard deviation
  • \(n\) is sample size
$$z=\frac{\bar{x}-\mu}{SE}$$
  • \(\bar{x}\) is the sample mean
  • \(\mu\) is the population mean
  • \(SE\) is the standard error calculated as \(\frac{\sigma }{\sqrt{n}}\)

An effect that exists is more likely to be detected with a change in the following variables utilized in the equations above:

  • \(n\) is larger (results in smaller SE and z-score closer to 0)
  • \(\sigma\) is smaller (results in smaller SE and z-score closer to 0)
  • \(\mu\) closer to \(\bar{x}\) (results in z-score closer to 0)