Critical Thinking·Reasoning About Evidence

Why Correlation Is Not Causation

In the summer of 1999, ice cream sales in New York City climbed sharply. So did drownings. If you plotted the two numbers month by month, they would track each other almost perfectly. A careless reader could conclude that ice cream causes drowning, or perhaps that drowning causes ice cream sales. Both claims are absurd, and yet the correlation is real. The numbers move together. The puzzle is what to make of that fact.

This is the canonical lesson behind the slogan "correlation is not causation." Two variables can rise and fall in lockstep without one producing the other. The phrase is repeated so often that it can start to sound like a verbal tic, a way of waving away inconvenient data. But the underlying point is precise, and learning to use it well requires more than skepticism. It requires knowing what else could be going on.

When two variables are correlated, at least four explanations are live. The first is genuine causation: A really does cause B. The second is reverse causation: B causes A, and we have the arrow pointing the wrong way. The third is a confounding variable: some third factor, C, causes both A and B, producing a correlation between them even though neither influences the other. In the ice cream case, the confounder is summer heat, which drives people to buy cold desserts and also to swim, which is where drownings happen. The fourth is coincidence: in any large dataset, some pairs of variables will move together by chance, especially if an analyst tries many combinations before reporting one.

These four possibilities are not exotic. They are the default menu, and a careful reasoner runs through them every time a correlational claim arrives. Studies linking coffee drinking to longer life have to contend with the possibility that healthier people drink more coffee, not the other way around. Studies showing that students who attend selective colleges earn more than students who attend less selective ones have to contend with the possibility that the traits which got them admitted, not the colleges themselves, account for the income gap. Almost every observational finding in nutrition, education, and economics lives under this cloud.

The honest response is not to dismiss correlations. They are often the first signal that something interesting is happening, and many true causal relationships were spotted as correlations long before they were confirmed. The honest response is to ask what would distinguish causation from the alternatives. Sometimes a controlled experiment can do it: randomly assign people to a treatment, and any difference in outcomes can be attributed to the treatment, because randomization breaks the link between the treatment and any pre-existing trait. Sometimes a natural experiment helps: a policy change or geographic boundary that splits otherwise similar people into groups can mimic randomization. Sometimes the best one can do is measure the obvious confounders and adjust for them statistically, while acknowledging that unmeasured confounders may remain.

Notice what this discipline asks of you. It does not ask you to disbelieve every correlation you see. It asks you to hold a correlational claim with the right grip — tight enough to take it seriously, loose enough to revise it when a better explanation appears. When a headline announces that people who do X live longer, the useful question is not "do I believe this?" but "what would have to be true for X itself to be doing the work, rather than the kind of person who does X, or some condition that produces both X and long life?"

That question is harder than the slogan. It is also what the slogan is for. "Correlation is not causation" is not a dismissal but an invitation: to name the alternatives, to ask which evidence could rule them out, and to keep the arrow of cause pointing only as confidently as the evidence allows.

Vocabulary

correlation: A statistical relationship in which two variables tend to move together — when one rises or falls, the other does as well — without any implication about why.
reverse causation: The error of identifying A as the cause of B when in fact B is the cause of A; the relationship is real but the direction of the arrow is wrong.
confounding variable: A third factor that independently influences both of the variables being compared, producing a correlation between them even when neither causes the other.
controlled experiment: A study in which researchers actively assign participants to conditions, typically at random, so that observed differences in outcomes can be attributed to the assigned condition rather than to pre-existing differences.
natural experiment: A situation occurring outside the lab — such as a policy change or geographic boundary — that divides otherwise similar people into groups in a way that mimics random assignment, allowing comparisons that approach causal inference.
observational finding: A result derived from watching variables as they naturally occur, without the researcher controlling who does what; such findings can show patterns but cannot by themselves establish causation.

Check your understanding

Question 1 of 5recall

According to the passage, what is the confounding variable that explains the correlation between ice cream sales and drownings?

Closing question

Think of a recent claim you encountered that linked two things together — a habit and an outcome, perhaps. Which of the four alternative explanations would be hardest to rule out, and what evidence might do it?

Why Correlation Is Not Causation

Vocabulary

Check your understanding

According to the passage, what is the confounding variable that explains the correlation between ice cream sales and drownings?

More in critical thinking

Deductive and Inductive Reasoning: Two Modes of Inference

How Base-Rate Neglect Misleads

How to Spot a Hidden Assumption