From xkcd, a comic by Randall Munroe
In a recent article, Paul Borsch described how correlations get bandied about without the understanding that a correlation is not a cause. And he’s right…except that, as others have pointed out, “correlation is not causation” often gets used to discount correlations entirely. And correlations often do have causes.
So how do we think critically about correlation and causation, without being enamored by correlations, nor being dismissive of them?
In this article, the first of a three part series, I’ll tackle eight things to know about correlation. In the next two articles, I’ll address key things to know about causes, and then present tips for determining whether a correlation has an underlying cause and, if so, what it is.
So what exactly is correlation? Correlation refers to the degree in which two measurements tend to vary together (mathematically, correlation has a more technical meaning).
An example of a correlation would be temperature and electricity usage during the summer: the hotter it gets, the more electricity gets used; the cooler it gets, the less electricity gets used. In this case, the temperature directly affects the electricity usage as people turn on their air conditioners. But not all correlations describe causes.
Take the correlation between ice cream sales and drowning deaths. As ice cream sales increase, so do drowning deaths. Does that mean selling ice cream causes people to drown? Probably not. More likely is that people swim more and eat more ice cream the hotter it gets, so both are driven by the outside temperature.
So what are some key things to know about correlation?
- Correlation does not equal causation
Just because a correlation exists between two factors doesn’t mean one factor causes the other factor, or in fact, that there is any relationship at all between the two factors. If you’re unsure why this is the case, read Paul Borsch’s article, or the many others about this adage.
- But correlation hints at causation
Correlation does not mean a cause exists, but it does indicate you should dig deeper because it makes a cause more likely. On the flip side, not having a correlation doesn’t prove there isn’t a causal relationship, but it does make it less likely.
- Correlation has different causes
If two metrics correlate with each other, it may be:
- the first caused the second
- the second caused the first
- a third factor caused both
- the two factors caused each other in a feedback loop
- a pure coincidence
- Correlation has no single metric
It can be measured in different ways. Each metric, called a correlation coefficient, has its own strengths and weaknesses. Knowing the limitations of the metric you are using helps you determine if the correlation is relevant.
For instance, the most common metric, Pearson’s r, only applies to linear correlations. Our temperature versus electricity usage example above is linear only if we look at the summer months; if we add in winter months with people using electric heating, our correlation flips and the hotter outside it is, the less electricity gets used. In this case, Pearson’s r might show no correlation, even though we know one exists. For this we need a non-linear correlation metric.
- Correlation has a strength
Correlations range from weak to strong. Weak correlations, even if a cause can be determined, may not worth taking action on.
To evaluate the strength of a correlation expressed using Pearson’s r, ignore the sign and look at its magnitude:
0.0 – 0.1 None 0.1 – 0.3 Small 0.3 – 0.5 Medium 0.5 – 1.0 Strong
These ranges are only approximate. Whether a correlation should be considered weak or strong depends on the measurement method and other factors.
- Correlation has a range
Correlation requires measuring at least two variables, each with a margin of error. Plus, you may be measuring only a subset of the entire population of people or items you could measure, the choice of which can change your results and introduce sampling error.
These errors mean you don’t have an absolute correlation strength, but a range of values called your confidence interval. The actual correlation value likely lies within this range. You can use the confidence interval to get a sense of how accurate a correlation is.
Note, however, that the confidence interval doesn’t have to be symmetric. You can have a correlation strength of 0.5, with a range from 0.4 (ie: -0.1) to 0.8 (ie: +0.3).
- Correlation ranges have a confidence level
Correlation ranges are not perfect. Due to the margins of error, the actual correlation value may lie outside the range given by the confidence interval. To describe how certain you can be that the value lies within the range, you can look at the confidence level.
Confidence levels indicate the chance the value lies within the range. Typical correlations use a confidence level of 95%, meaning 1 in 20 correlations you look at have an actual correlation value outside the given range.
- Correlation metrics can be wrong
Correlation gets calculated using a mathematical formula that can be tricked. The result of this formula can indicate a correlation when examination of the data would indicate no correlation exists. A great visual example of this is Anscombe’s quartet.
That’s the basics of correlation. For an in depth look at correlation, check out the online book Understanding Correlation by R.J. Rummel. Next time I’ll tackle key things you need to know about causes.
So, did this article give you a better understanding of correlation? Comment below.
Disclaimer: I am not a statistician. Rather I am interested in how statistical concepts can be translated into business language and be used to make better decisions. If I got anything wrong above, please leave a comment and I’ll try to correct it.