LearningLibrary

Psychology·Methodology

Why the Replication Crisis Reshaped Psychology

In 2015, a team of nearly three hundred researchers published the results of an unusually ambitious project. They had tried to repeat one hundred psychology experiments drawn from leading journals, following the original procedures as closely as the published methods allowed. Fewer than forty of the studies produced results that clearly matched the originals. Some famous findings — about willpower depletion, about how striking a powerful pose changes hormone levels, about how subtle cues prime stereotyped behavior — failed entirely. The field had a name for what it was looking at by the time the report appeared: the replication crisis.

The crisis was not really about a few bad studies. It was about the gap between how psychology presented itself and how it actually worked. A published paper looks like a clean story: a hypothesis, a study, a significant result, a conclusion. The reality behind many papers was messier. Researchers ran several versions of an analysis and reported the one that crossed the threshold for statistical significance. They tested several outcome measures and wrote up the one that worked. They stopped collecting data when the result looked good and kept collecting when it did not. None of these moves felt like cheating from the inside; each could be defended as a judgment call. Together, they are now called p-hacking, and they reliably manufacture findings that will not survive an honest second look.

A related practice, HARKing — hypothesizing after the results are known — gave the manufactured findings a theoretical gloss. A researcher would notice an interesting pattern in the data and write the introduction as if that pattern had been the prediction all along. The resulting paper read as a confirmed hypothesis when it was really a description of one dataset. Combined with publication bias, the tendency of journals to publish positive results and quietly reject null ones, these habits filled the literature with claims that were too clean to be true.

What made the crisis specifically reshape psychology, rather than merely embarrass it, was that the response went structural. Reformers argued that individual virtue was not enough; the incentives had to change. The most consequential reform was preregistration: researchers now publicly record their hypotheses, sample size, and analysis plan before collecting data, so that the difference between predicted and exploratory findings is visible to readers. Registered reports go further, sending the study design out for peer review before any data exist, with acceptance contingent on the design rather than the outcome. Open data and open materials have become standard expectations at many journals, allowing other researchers to check analyses directly.

The crisis also changed what counts as evidence. A single significant study, however elegant, is now treated as a starting point rather than a conclusion. Meta-analyses are read more skeptically, since they inherit the biases of the literature they summarize. Effect sizes — how large a phenomenon actually is — get more attention than bare statistical significance, which tells you only that an effect is unlikely to be exactly zero. Sample sizes have grown, sometimes dramatically, because the small studies of earlier decades were underpowered to detect the modest effects that human behavior usually produces.

Not every corner of psychology was hit equally. Some areas, particularly in cognitive psychology and psychophysics, replicated well; the crisis fell hardest on social and personality research, where effects are often subtle and contexts hard to standardize. And the reforms have their own costs. Preregistration constrains the genuine discoveries that come from following the data where it leads. Larger samples are expensive and slow. Some critics worry that a culture of suspicion makes the field timid, more interested in defending small claims than in pursuing big ones.

Still, the deeper change is hard to undo. Psychology has shifted from treating a published result as a fact to treating it as a claim — one whose credibility depends on the transparency of the process behind it. The crisis was painful because it revealed that decades of textbook findings rested on weaker ground than anyone had admitted. It was reshaping because the field, instead of patching individual studies, began to ask what kind of evidence a science of human behavior can honestly produce.

Vocabulary

replication crisis
The recognition, beginning in the early 2010s, that a substantial fraction of published findings in psychology and adjacent fields cannot be reproduced when studies are carefully repeated.
p-hacking
The practice of trying multiple analyses, measures, or stopping rules and reporting only the version that yields a statistically significant result, which inflates the rate of false positives in the literature.
HARKing
Hypothesizing After the Results are Known: framing a pattern noticed during data exploration as if it had been the study's original prediction.
publication bias
The tendency of journals and authors to publish studies with positive or significant results while leaving null findings unpublished, distorting the apparent state of evidence in a field.
preregistration
The public recording of a study's hypotheses, sample size, and analysis plan before data collection, so that confirmatory predictions can be distinguished from exploratory findings.
Registered reports
A publication format in which a study's design is peer-reviewed and provisionally accepted before data are collected, with final acceptance based on the quality of the methods rather than the direction of the results.
Effect sizes
Quantitative measures of the magnitude of a phenomenon, as distinct from whether the effect is statistically distinguishable from zero.

Check your understanding

Question 1 of 5recall

According to the passage, roughly what fraction of the one hundred psychology experiments in the 2015 replication project produced results that clearly matched the originals?

Closing question

If transparency about methods is now central to credibility in psychology, what does that imply for fields outside science — journalism, history, policy analysis — that also build cumulative claims from individual investigations?

More in psychology