Correlation Does Not Imply Causation¶
Date published: 2019-03-05
Category: Math
Subcategory: Inferential Statistics
Tags: correlation, causation, scatter, python, pandas
The statement correlation does not imply causation is one of the most famous in the field of statistics. It's incredibly important to understand so we properly understand the relation between two variables of numeric data.
Correlation¶
Correlation is a measure of the relation of two numeric variables. For example, we'd expect a positive correlation between the temperature outside and ice cream sales at a shop. If it's hotter outside, we'd expect more people to buy ice cream. Ice cream sales likely positively correlate with increased temperature. There are exact numerical measures of correlation such as the Pearson correlation coefficient and the Spearman's rank correlation coefficient.
Causation¶
Causation indicates a relation between two variables in which one variable if affected by another. For example, there have been numerous studies that provide evidence that smoking causes lung cancer. A study, in statistical terms, is a detailed investigation and analysis of a situation. This post won't go into additional details of studies as they require lots of careful planning and implementation to perform successfully.
Correlation vs. Causation¶
Often times, people naively state a change in one variable causes a change in another variable. They may have evidence from real-world experiences that indicate a correlation between the two variables, but correlation does not imply causation! For example, more sleep will cause you to perform better at work. Or, more cardio will cause you to lose your belly fat. These statements could be factually correct. However, with these statements, we need evidence from a properly completed study to factually state there is a causaul relation between the two variables.
If someone states a potentially spurious casual statement like this, I'd encourage them to perform research on independent studies to gather official evidence. Studies are often done by research-driven institutions and universities. Here is a paper published by the Journal of Obesity that cites several studies that provide evidence that high-intensity intermittent exercise may be effective to cause people to lose abdominal body fat.
Tyler Vigen has an interesting page on his website that visualizes spurious correlations. Below is an example that shows a strong positive linear correlation with U.S. spending on science, space and technology with suicides by hanging, strangulation and suffocation.
However, do you think U.S. spending in this field causes hanging suicides? My hypothesis is that there's no evidence to support a causal relationship between these two variables.
While this example from Tyler's website seems extreme, it's poking fun at how people can immediately visualize a relationship between two numerical variables and naively jump to the conclusion that there's a causal relationship.
Lastly, I want to show a funny comic from the comic website XKCD about correlation and causation. "
The joke is that the guy on the right feels he doesn't have strong evidence (such as through a study) to prove his statistics class caused him to believe that fact is true.
Additional Misconceptions on Correlation vs. Causation¶
A mediator variable is a variable that explains the relationship between independent and dependent variables. For example, we may notice a positive correlation with increased ice cream shop sales with increased heat. However, a potential mediator variable could be the count of people sweating. It's possible an increase in the count of people sweating in the local area influences ice cream sales. If this were true, you may want to open an ice cream store near a sauna rather than simply in a hot weather area.
To make a causal relationship, we need to rule out lurking variables. These are variables that are not included in the independent or dependent variable but can affect the relationship between the two. The definition of the mediator variable above is considered a lurking variable too. This idea of a third variable is another name for a potential third variable that affects the causal relationship between the independent and dependent variables.
Another example is that a soccer coach (naively) noticed that players who practiced additionally after games caused them to love soccer more. However, we don't know if the players playing more came before their love of soccer. Perhaps those players loved the game of soccer before the season started and that could have caused them to want to practice more after games. In this situation, there's ambiguous temporal precedence - the unknown of which variable came first for inferring causality.
Another example is a supplements company claimed that people who drink their pre-workout shake directly before their workout complete approximately 2 more reps for each exercise and therefore have a better workout. The company claimed their pre-workout shake caused increased workout reps. This is considered a post hoc fallacy - an action taken before another action doesn't mean it directly caused the next action.