An independent samples t-test compares the means of two independent samples to determine whether there is evidence that the expected population means of each sample would be significantly different. The two independent samples can contain different subjects but the same dependent variable should be measured in each sample. An example - you want to measure if there is a significant difference in the average coffee price of independent coffee shops in Manhattan (part of New York City) versus San Francisco, California.
In order to fully grasp the concepts in this post, it'll help to have familiarity with the following concepts: z-tests, hypothesis testing, p-values, normal distribution, standard deviation, standard error and the central limit theorem.
Typically, you want a larger \(n\) from each sample to control for individual differences.
Similar to dependent t-tests, independent t-tests have the same criteria for setting hypotheses and interpreting results. However, there are different equations to calculate the standard error and t-statistic that will be covered below in the example.
Data Requirements for Independent Samples t-tests
The conditions below are largely taken from Kent State University's tutorial on t-tests.
Your data must meet the following requirements:
- Dependent variable that is continuous (i.e., interval or ratio level)
- Independent variable that is categorical (i.e., two or more groups)
- Independent samples/groups (i.e., independence of observations)
- There is no relationship between the subjects in each sample. This means that subjects in the first group cannot also be in the second group. The subjects in either group cannot influence subjects in the other group
- No group can influence the other group
- Violation of this assumption will yield an inaccurate p value
- Random sample of data from the population
- Normal distribution (approximately) of the dependent variable for each group
- Non-normal population distributions, especially those that are thick-tailed or heavily skewed, considerably reduce the power of the test
- Among moderate or large samples, a violation of normality may still yield accurate p-values
- No outliers
from scipy.stats import t import numpy as np import scipy.stats as stats import seaborn as sns import matplotlib.pyplot as plt from statsmodels.stats import weightstats as statsmodelsweightstats % matplotlib inline
Set Visualization Style
Steps to Perform Independent Samples t-test
1) Setup the experiment to record measurements from two samples.
2) Set an alpha level for the test, a null hypothesis and alternative hypothesis.
3) Run the experiment and collect the data.
4) Determine if data meets requirements to perform an independent samples t-test.
5) Calculate the t-critical value.
6) Calculate the t-statistic.
7) Compare the t-statistic to the t-critical value. Interpret results of the experiment based on the original hypotheses.
1) Setup Experiment to Follow Initial Data Requirements
Below is fictional data scenario. I want to know if the price of coffee is significantly different between a sample of coffee shops in Manhattan versus San Francisco (SF). In order to retrive the data, I will randomly select 5 coffee shops from various neighborhoods in each city. For each shop, I'll take a measurement of the cheapest small drip coffee option. For each city, I'll do this in multiple neighborhoods.
2) Set an Alpha Level and Original Hypotheses
The alpha level will be \(0.05\).
alpha = 0.05
Is there a significant difference in the average price of drip coffee at independent coffee shops in Manhattan versus San Francisco?
The null hypothesis is that there is no significant difference in the true population means for the average cup of drip coffee between the cities of San Francisco and Manhattan.
This is a two-tailed test.
In this independent samples t-test, I'm trying to use sample means from each of the two cities in order to infer the true population parameters.
3) Collect the Data
Ultimately, I collected 30 samples of coffee prices in Manhattan and 35 in San Francisco.
manhattan_coffee_prices = [1.5]*2 + [1.79]*4 + [1.85]*5 + [1.99]*7 + *3 + [2.19]*2 + [2.29]*5 + [2.5]*2
sf_coffee_prices = [1.99]*2 + [2.29]*4 + [2.49]*9 + [2.79]*7 + [2.95]*6 + [2.99]*4 + [3.49]*3
4) Determine if Data Meets Requirements to Perform an Independent Samples t-test
The dependent variable, the price of coffee in U.S. dollars is continuous. The independent variable, the identity of the city, is a proper independent variable.
In the data collection process, I ensured the coffee shops were all independent of one another. There is no relationship in ownership or name for any of the shops.
I'm assuming there's influence of groups on setting the price of coffee because of distinct ownership and cities.
I mentioned I randomly sampled coffee shops from various neighborhoods in each city.
Let's check the distribution of coffee prices in each city.
Distribution of Prices in Each City
The distribution of coffee prices for the sample in Manhattan below looks approximately normal.
plt.figure(figsize=(10, 7)) sns.distplot(manhattan_coffee_prices, color='crimson') plt.title("Distribution of Coffee Prices in Manhattan", y=1.015, fontsize=22) plt.xlabel("price of cup of coffee [$]", labelpad=14) plt.ylabel("count of occurences", labelpad=14);
The distribution of coffee prices for the sample in San Francisco below looks approximately normal.
plt.figure(figsize=(10, 7)) sns.distplot(sf_coffee_prices, color='darkcyan') plt.title("Distribution of Coffee Prices in San Francisco", y=1.015, fontsize=22) plt.xlabel("price of cup of coffee [$]", labelpad=14) plt.ylabel("count of occurences", labelpad=14);
Check for Outliers in Each City
The boxplot of Manhattan cup of coffee prices belows show no outliers.
plt.figure(figsize=(9, 5)) sns.boxplot(manhattan_coffee_prices, color='crimson', saturation=0.9) plt.title("Distribution of Coffee Prices in Manhattan", y=1.015) plt.xlabel("coffee prices [$]", labelpad=14);
The boxplot of San Francisco cup of coffee prices belows show no outliers too.
plt.figure(figsize=(9, 5)) sns.boxplot(sf_coffee_prices, color='darkcyan', saturation=0.9) plt.title("Distribution of Coffee Prices in San Francisco", y=1.015) plt.xlabel("coffee prices [$]", labelpad=14);
Overall, I think this data fits the requirements to perform an independent samples t-test.
5) Calculate t-critical Value
Assign variables for the count of observations in each group.
manhattan_count_observations = len(manhattan_coffee_prices) sf_count_observations = len(sf_coffee_prices)
Different than other types of t-tests, the calculation for degrees of freedom for an independent samples t-test is:
This can be simplified to:
degrees_of_freedom = sf_count_observations + manhattan_count_observations - 2 degrees_of_freedom
This is a two-tailed t-test so each tail probability on the t-distribution is \(0.025\). We use that value below.
alpha = 0.05 two_tailed_test_prob_tail = alpha/2 t_critical = round(stats.t.ppf(two_tailed_test_prob_tail, degrees_of_freedom), 3) t_critical
Since this is a two-tailed test, the t-critical value is actually \(\pm1.998\)
6) Calculate the t-statistic
There's a method in the Scipy package for performing indepenent t-tests called
ttest_ind(). We set the following arguments:
ato one sample of values
bto the second sample of values
Falsesince we assume the samples have unequal population variances
omitsince the two samples are unequal sizes and SciPy must handle that discrepancy appropriately with equations on the backend
In the method, there is a calculation for the mean of
b. Our returned t-statistic is negative since the mean of
b (the average price for a cup of coffee in SF) is larger than the mean of
a (the average price for a cup of coffee in Manhattan).
stats.ttest_ind(a=manhattan_coffee_prices, b=sf_coffee_prices, equal_var=False, nan_policy='omit')
I also can utilize the simple t-statistic math equation:
First, I'll calculate the mean of each sample.
mean_manhattan_coffee_price = np.mean(manhattan_coffee_prices) mean_sf_coffee_price = np.mean(sf_coffee_prices)
I calculate the standard deviation for each sample using the sample standard deviation formula.
std_dev_manhattan_coffee_prices = np.std(manhattan_coffee_prices, ddof=1) std_dev_sf_coffee_prices = np.std(sf_coffee_prices, ddof=1)
I calculate the standard error. This standard error is the standard deviation for an estimated sampling distribution that resembles the difference in sample means.
Below is the formula for standard error among two samples for indpendent t-tests. This accounts for any potential variation in sizes of the samples.
The subscript \(1\) denotes variables for one group and the subscript \(2\) denotes variables for the second group.
- \(s\) is the sample standard deviation
- \(n\) is the number of observations in a sample
standard_error = np.sqrt((std_dev_manhattan_coffee_prices**2/manhattan_count_observations)+(std_dev_sf_coffee_prices**2/sf_count_observations)) standard_error
Finally, use the intermediary calculations above to calculate the exact t-statistic value.
t_statistic = (mean_manhattan_coffee_price-mean_sf_coffee_price)/standard_error t_statistic
This manual version of calculating the t-statistic returned the same result as the programmatic way.
Interpretation of Results
The t-statistic value of \(-9.23\) is much smaller than the negative t-critical value of \(-1.998\). Therefore, there is sufficient evidence to reject the null hypothesis. There is a significant difference between a cup of drip coffee prices sold in independent shops between San Francisco and Manhattan at an alpha level of \(0.05\). I would not expect to see such a large discrepancy between cups of coffee due to random chance.
The difference in coffee cup prices between Manhattan and SF is \(9.23\) times greater than you would expect if the null hypothesis was true. Another way of saying this is that the difference in the two city's coffee prices is \(9.23\) times greater than it would be by random chance.
The returned p-value is \(4.39e-19\) which is equivalent to a probability of \(0.000000000000339\) or \(0.0000000000339\%\). If I assumed both cities had the same cup of drip coffee in the long-run (the null hypothesis), I would expect a discrepancy from our t-test this big just \(0.0000000000339\%\) of times with sampling shops. That's a very very low probability!! With such a small probability, it would be nearly impossible to get a difference in cup of coffee prices so large between cities this big due to random chance.
I would infer that I'd expect to see similar results for the entire population.
Effect Size Measures
What proportion of the difference in the mean price of a cup of coffee can be attributed to being in a different city of either SF or Manhattan?
The \(r^2\) returned below is \(0.575\)
r_squared_coffee = round(t_statistic**2 / (t_statistic**2 + degrees_of_freedom), 3) r_squared_coffee
There is a fair amount of variation between the mean price cup of drip coffee in shops between SF and Manhattan.
Independant t-test Example: Price of Beverages in SF Greater Than in Manhattan by \(\\)0.43$
Let's say some website claimed the average price of a single beverage item served at a retail location in San Francisco is \(\\)0.43$ more than in Manhattan. I can utilize this information to form new hypotheses to perform a t-test. I'm curious if a small cup of drip coffee at independent coffee shops in San Francisco is still relatively significantly more expensive than Manhattan.
Now, this is a one-tail positive direction independent samples t-test.
Calculate t-critical value
This is a one-tailed positive direction t-test with an alpha level of \(0.05\), there should be an area under the curve of the t-distribution to the left of a t-critical value of \(0.95\).
alpha = 0.05 area_left_under_curve = 1-alpha t_critical = round(stats.t.ppf(area_left_under_curve, degrees_of_freedom), 3) t_critical
I also can utilize the simple t-statistic math equation from earlier:
However, I want to incorporate \(\\)0.43$ in the numerator so the t-statistic calculation will be:
For the t-statistic equation, the numerator should be closer to 0 with this additional value of the difference in means under the null hypothesis.
Previously, the numerator for the two-tailed test was:
original_numerator = mean_manhattan_coffee_price-mean_sf_coffee_price original_numerator
The new numerator is now:
updated_numerator = mean_sf_coffee_price-mean_manhattan_coffee_price-0.43 updated_numerator
Calculate the t-statistic using the formula above.
t_statistic_sf_greater = (updated_numerator)/standard_error t_statistic_sf_greater
x1to the list of
x2to the list of
largersince the alternative hypothesis is to identify if the t-statistic is larger than a positive t-critical value
unequalbecause the standard deviation of samples is unequal and there are unequal sample sizes. (Read more about Welch's variation of the t-test here.)
0.43since that's the value we want subtracted from the difference in means
results = statsmodelsweightstats.ttest_ind(x1=sf_coffee_prices, x2=manhattan_coffee_prices, alternative='larger', usevar='unequal', value=0.43) t_statistic_sf_greater2 = results p_value = results degrees_of_freedom = results
With the formula above and using the
StatsModels library, we get the same t-statistic result from using the formula above.
The t-critical value is \(1.669\) and the t-statistic is \(3.67\)
Since this is a one-tailed positive direction test and the t-statistic is greater than the t-critical value, there is sufficient evidence to reject the null hypothesis.
There is a relative significant difference that the price of a cup of a small drip coffee at independent shops in San Francisco is greater than Manhattan.