The Reproducibility Crisis in Research: Causes, Evidence, and Solutions (2026)

·

June 2, 2026

The Reproducibility Crisis in Research: Causes, Evidence, and Solutions (2026)

Abstract. The reproducibility crisis — the systematic failure of published scientific findings to hold up when independently repeated — represents one of the most consequential methodological challenges facing contemporary empirical science. This article synthesises the landmark empirical evidence documenting the scale of the problem, from the Open Science Collaboration’s 2015 mass-replication project in psychology to the Reproducibility Project: Cancer Biology; examines the structural causes that generate irreproducible findings, including undisclosed analytical flexibility, hypothesising after results are known (HARKing), publication bias, and chronically underpowered study designs; and evaluates the reform strategies — preregistration, open data, registered reports, and sample-size planning — that have demonstrated measurable improvements in replicability. The analysis draws on primary sources from Science, Nature, PLOS Medicine, eLife, and Nature Reviews Neuroscience.

Key Finding: The reproducibility crisis in research describes the empirical discovery that a large proportion of published scientific results fail to replicate when independently repeated. The Open Science Collaboration (2015) found that only 36% of 100 psychological studies produced statistically significant results on replication; Baker’s (2016) survey of 1,576 researchers found more than 70% had failed to reproduce another scientist’s experiments. The primary causes are p-hacking, HARKing, publication bias, and low statistical power. Validated solutions include preregistration, open data mandates, registered reports, and larger, better-powered samples.

1. Introduction: A Crisis in Plain Sight

Science derives its epistemic authority from a deceptively simple principle: results obtained through rigorous methods should hold up when independently repeated. Reproducibility is not merely one desirable property among many — it is the operational criterion by which a scientific claim graduates from private observation to public knowledge. When that criterion fails systematically, the entire evidentiary edifice built on those claims becomes suspect.

The term “reproducibility crisis” entered widespread circulation after 2011, though the underlying problems had been documented decades earlier. In its most concrete form, the crisis refers to the empirical finding that a substantial fraction of published results in fields ranging from psychology to oncology cannot be independently replicated under conditions that closely follow the original methodology. This is not primarily a story about outright fraud — misconduct accounts for a small minority of non-reproducible findings. Rather, it is a story about how a combination of legitimate-seeming analytical choices, structural incentives, and statistical misunderstandings can collectively generate a literature that is less reliable than it appears.

Understanding the reproducibility crisis in research explained in full requires engaging with three distinct levels of analysis: the empirical record (how large is the problem and where is it concentrated?), the causal mechanisms (what practices and incentives generate irreproducible findings?), and the reform landscape (which interventions have demonstrated measurable improvements?). This article addresses all three levels, drawing on primary peer-reviewed sources throughout.

It is also worth distinguishing three related but distinct concepts that are often used interchangeably:

Reproducibility (or computational reproducibility): the ability to obtain the same numerical results from the same data using the same analysis code.
Replicability: the ability to obtain consistent results when a study is repeated with new participants/data using the same or equivalent methods.
Generalisability (or external validity): the degree to which findings extend beyond the specific sample, setting, and operationalisation of the original study.

The crisis primarily concerns replicability, though failures of computational reproducibility are also widespread and compound the problem. The philosophical commitments embedded in different research paradigms — positivism, constructivism, critical realism — shape how replicability is conceptualised and why it matters differently across disciplines.

Researchers who want to understand how to prevent these problems in their own work should consult our step-by-step guide on how to preregister a study on OSF and AsPredicted, the guide to FAIR data principles for research data management, and the treatment of inter-rater reliability and Cohen’s kappa for studies involving human coders. For researchers new to critical engagement with the literature, our guide on how to read an academic paper using the three-pass method provides essential scaffolding for evaluating reproducibility claims in published work.

2. Landmark Empirical Evidence

Video: The Replication Crisis: Crash Course Statistics #31 — CrashCourse (17M subscribers)

2.1 Open Science Collaboration (2015): Estimating the Reproducibility of Psychological Science

The most influential single study documenting the replication crisis remains the Open Science Collaboration’s (OSC) 2015 project, published in Science under the title “Estimating the Reproducibility of Psychological Science” (doi: 10.1126/science.aac4716). The OSC assembled a distributed team of researchers who attempted to replicate 100 original studies published in three leading psychology journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition. Replications used high-powered designs and, wherever possible, original materials obtained directly from the study authors.

The headline result was striking: only 36% of replications produced statistically significant results, compared with 97% of the originals. Several additional metrics reinforced this finding:

47% of original effect sizes fell within the 95% confidence interval of the replication effect size — meaning that in a majority of cases, the replication result was statistically incompatible with the original.
Replication effect sizes were, on average, roughly half the magnitude of the original effects — a pattern consistent with publication bias inflating initial estimates.
Subjective expert ratings found that only 39% of replications were judged to have successfully reproduced the original result.
Cognitive psychology studies replicated at a higher rate (50%) than social psychology studies (25%), suggesting that replicability varies meaningfully across sub-disciplines.

The OSC findings provoked immediate controversy. A commentary by Daniel Gilbert and colleagues (2016) argued that the replications had systematically underestimated original effect sizes due to methodological differences. The OSC team responded, and an independent statistical re-analysis broadly supported the original conclusions. The debate itself illustrated a broader epistemic challenge: without pre-specified criteria for what counts as a successful replication, reasonable researchers can disagree about whether a given attempt has succeeded or failed.

2.2 Baker (2016): The Nature Survey of 1,576 Scientists

While the OSC project quantified replication failures from within a specific discipline, Monya Baker’s (2016) survey, published in Nature (vol. 533, pp. 452–454), captured the broader perception of the problem across scientific fields. Baker surveyed 1,576 researchers from disciplines spanning chemistry, biology, physics, medicine, and the social sciences.

The survey produced several findings that have since become canonical reference points:

More than 70% of researchers reported having tried and failed to reproduce another scientist’s experiments.
More than half had failed to reproduce their own previous experiments.
52% agreed there was a significant “crisis” of reproducibility, though fewer than 31% believed that failure to reproduce published results necessarily meant the original result was wrong.
Researchers overwhelmingly identified solutions: nearly 90% ticked “more robust experimental design,” “better statistics,” and “better mentorship” as important remedies.
Fewer than 20% reported ever being contacted by another researcher unable to reproduce their work — suggesting that failed replications rarely enter the formal record.

Crucially, the survey found that researchers simultaneously acknowledged a systemic problem while largely trusting the published literature they relied upon — a cognitive pattern that helps explain why the crisis persisted for so long before attracting sustained institutional attention.

2.3 Reproducibility Project: Cancer Biology

Concern about replicability in preclinical biomedical science intensified after anecdotal reports from pharmaceutical companies that high-profile cancer biology findings were failing to translate. The Reproducibility Project: Cancer Biology, coordinated by the Center for Open Science and Science Exchange, was designed to provide systematic evidence. Results were published in a series of papers in eLife, with the comprehensive summary appearing as Errington et al. (2021) “Investigating the replicability of preclinical cancer biology” (doi: 10.7554/eLife.71601).

The project originally planned 193 experiments drawn from 53 high-impact papers. In practice, only 50 experiments from 23 papers were completed — itself a finding, reflecting the practical difficulty of replication work including insufficient methodological detail in original publications, unavailability of key reagents, and resistance from some original authors.

Among completed replications, the results were sobering:

79% of positive effects replicated in the same direction.
However, when applying the more stringent criterion of statistical significance plus same direction, only 43% succeeded.
Using the criterion that the original effect size falls within the replication’s confidence interval, only 18% met the bar.
Most striking was the magnitude of effect-size shrinkage: replication effect sizes were, by median, approximately 85% smaller than original effects, and 92% of replication effect sizes were smaller than their originals.

These figures must be interpreted with caution — the sample of 23 papers is not representative of the field as a whole, and the feasibility constraints that shaped which experiments could be completed may have biased the selection. Nevertheless, the direction of findings is consistent across multiple criteria and aligns with the broader literature on effect-size inflation in the published record.

2.4 Ioannidis (2005): The Theoretical Groundwork

Before mass-replication projects provided direct empirical evidence, John Ioannidis offered a theoretical argument that set the intellectual stage. In “Why Most Published Research Findings Are False,” published in PLOS Medicine (2005; doi: 10.1371/journal.pmed.0020124), Ioannidis used Bayesian reasoning to show that, under realistic assumptions about statistical power, pre-study probability of true effects, and researcher degrees of freedom, the positive predictive value of a statistically significant finding is often below 50% — meaning the probability that a published significant result reflects a true effect is less than the probability that it does not.

The argument does not depend on any specific number; it follows from the logic of conditional probability. When base rates of true hypotheses are low (as in exploratory research), when power is modest, and when the multiple-comparison problem goes uncorrected, the mathematics dictate that most published positive findings will be false positives. The paper attracted both enormous influence and substantive criticism — notably from biostatisticians Jager and Leek, who estimated the false positive rate in biomedical abstracts at approximately 14% rather than the majority Ioannidis suggested. The true figure is probably discipline-dependent and impossible to determine precisely, but the Ioannidis framework remains analytically valuable for understanding why the problem exists at all.

3. Structural Causes of the Crisis

3.1 P-Hacking and Researcher Degrees of Freedom

The term “p-hacking” — the exploitation of analytical flexibility to drive a p-value below the conventional 0.05 threshold — was popularised by Simmons, Nelson, and Simonsohn in their landmark 2011 paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” published in Psychological Science (doi: 10.1177/0956797611417632). The authors demonstrated that four individually defensible analytical choices — adding participants, dropping conditions, including or excluding covariates, and transforming dependent variables — could jointly inflate the Type I error rate to over 60% in simulated studies.

Critically, Simmons and colleagues showed this not through obvious data fabrication but through a playful demonstration: by deploying these legitimate-seeming flexibilities, they “proved” (via conventional p < 0.05 results) that listening to the Beatles song “When I’m Sixty-Four” caused participants to become literally younger. The demonstration exposed that the problem lies not with individual bad actors but with a system that incentivises results without mandating transparency about the analytical garden of forking paths.

Researcher degrees of freedom encompass decisions made before, during, and after data collection:

Deciding when to stop collecting data (optional stopping)
Choosing which of several outcome measures to report
Deciding which participants to exclude (and on what grounds)
Selecting from multiple available statistical tests
Transforming variables post hoc to achieve normality
Adding or removing covariates until results become significant

None of these decisions is inherently illegitimate; problems arise when they are made after inspecting results and when that process is undisclosed to readers and reviewers.

3.2 HARKing: Hypothesising After Results Are Known

HARKing — an acronym coined by Norbert Kerr (1998) in Personality and Social Psychology Review — describes the practice of presenting post hoc hypotheses as if they had been formulated prior to data collection. A researcher who finds an unexpected significant effect and rewrites the introduction to “predict” it has engaged in HARKing. The result is a paper that looks like a confirmatory test of a prior prediction but is in fact an exploratory pattern-finding exercise presented with false confirmatory framing.

HARKing interacts dangerously with p-hacking: once a researcher has found a significant result through analytical flexibility, HARKing provides the narrative scaffolding that makes the finding appear theoretically motivated. Readers and reviewers, seeing a hypothesis confirmed at p < 0.05, have no way of knowing that the hypothesis was written after the result was observed. The inflation of the literature with HARKed findings helps explain why many results feel compelling on their face but fail to replicate under prospective testing conditions.

3.3 Publication Bias and the File-Drawer Problem

Publication bias refers to the systematic tendency for journals to publish statistically significant positive results at higher rates than null or inconclusive results. The “file-drawer problem” — originally described by Robert Rosenthal (1979) — captures the complementary phenomenon: studies that produce null results are more likely to go unpublished and remain in researchers’ file drawers (or, today, on abandoned hard drives), where they cannot correct the published record.

The consequences for the literature are predictable and empirically documented. Meta-analyses that rely exclusively on the published record overestimate effect sizes because they are drawing from a non-representative sample of completed studies. When publication bias is severe, the mean effect size in a meta-analysis can substantially exceed the true population effect size. Several funnel-plot asymmetry tests — including Egger’s test and the trim-and-fill procedure — attempt to detect and correct for publication bias, but these methods have well-known limitations and cannot substitute for a less biased primary literature.

It bears emphasising that publication bias is produced primarily by editorial and peer-review gatekeeping rather than by researchers unilaterally suppressing their work. Journals that explicitly reward novelty and significance create a selection environment in which null results are systematically disadvantaged at the submission stage, independent of their methodological quality.

3.4 Low Statistical Power

Statistical power — the probability of detecting a true effect of a given size with a given sample — is the foundational quantity that links experimental design to inferential reliability. Chronically underpowered studies compound every other problem in the reproducibility landscape: they produce inflated effect-size estimates when they do achieve significance (the “winner’s curse”), they generate variable results that are difficult to reconcile across independent laboratories, and they make replication particularly challenging because a direct replication using the same small sample size has a high probability of returning a null result even when the original effect is genuine.

Button et al. (2013), in “Power failure: Why small sample size undermines the reliability of neuroscience,” published in Nature Reviews Neuroscience (doi: 10.1038/nrn3475), conducted a meta-analysis of statistical power in the neurosciences. Their analysis found that the median statistical power across surveyed studies was approximately 21%, dropping to around 8% in some sub-fields. This means that in the average neuroscience study, there was roughly a one-in-five chance of detecting a genuine medium-sized effect if it existed. The implications are severe: most studies are designed to fail even under the optimistic assumption that the hypothesis under test is correct.

Low power is not primarily a product of carelessness; it reflects the genuine difficulty and expense of data collection in many fields, the absence of formal power analysis requirements at funding and publication stages, and a widespread misunderstanding of what statistical power means in practice. Many researchers correctly compute a required sample size but then underpower their study by using an optimistic effect-size estimate (often borrowed from a prior underpowered study with inflated estimates), creating a self-reinforcing cycle.

3.5 Incentive Structures in Academic Publishing

All the specific practices described above — p-hacking, HARKing, selective reporting, small samples — are rational responses to a publication ecosystem that rewards novelty and significance and penalises null results and replication attempts. Researchers who invest months in a careful, pre-registered, high-powered replication study and obtain a null result face steep publication costs: few journals actively solicit direct replications, and a null result from a replication is often publishable only in specialty outlets with limited reach.

Career advancement in academic science is heavily tied to publication counts, citation rates, and the prestige of outlets — metrics that all reward the production of positive findings. Grant funding, hiring, promotion, and tenure decisions frequently privilege the same signals. In this environment, the rational career strategy is not to maximise the accuracy of the individual researcher’s contribution to the cumulative scientific record but to maximise the probability of producing publishable positive results. The crisis is, at its structural root, a principal-agent problem: society asks scientists to advance knowledge, but universities and journals have created incentive systems that reward the performance of discovery rather than its substance.

4. Variation Across Disciplines

The reproducibility crisis is not uniformly distributed. Evidence suggests that some fields and sub-fields are substantially more affected than others, though measurement is complicated by the fact that fields differ in their norms around replication, the nature of the phenomena they study, and the methodological tools available to them.

Discipline	Documented Replication Rate / Key Evidence	Primary Source
Social psychology	~25% of studies replicated significantly	Open Science Collaboration, 2015
Cognitive psychology	~50% of studies replicated significantly	Open Science Collaboration, 2015
Preclinical cancer biology	43% replicated (sig. + direction); 18% by effect-size CI criterion	Errington et al., 2021
Neuroscience	Median statistical power ~21%; severe replication challenges	Button et al., 2013
Economics	Approximately 61% of experimental economics findings replicated (Camerer et al., 2016, Science)	Camerer et al., 2016
Chemistry / physics	Systematic replication projects less common; concerns focused on computational reproducibility	Various

The apparent relative robustness of fields like physics and chemistry should not be interpreted as evidence of immunity. These disciplines benefit from stronger traditions of sharing raw data and analytical code, tighter theoretical constraints that make p-hacking less rewarding, and experimental paradigms that are often more precisely specified. However, concerns about computational reproducibility — the ability to re-run published analyses on shared data and obtain the same numerical results — are substantial across all quantitative disciplines.

5. Evidence-Based Reform Strategies

5.1 Preregistration

Preregistration involves the formal public registration of a study’s hypotheses, design, data-collection procedures, and analysis plan prior to data collection. By creating a time-stamped, publicly accessible record of what was planned before data were observed, preregistration allows readers and reviewers to distinguish confirmatory tests (which carry strong inferential weight) from exploratory analyses (which require replication). The primary platforms include the Open Science Framework (OSF; osf.io) and AsPredicted (aspredicted.org). For a full step-by-step walkthrough, see our companion guide on how to preregister a study in 2026.

Nosek et al. (2018), in “The preregistration revolution,” published in PNAS (doi: 10.1073/pnas.1708274114), provided the theoretical and practical framework. Preregistration does not prevent exploratory analysis — it demarcates it, enabling readers to apply appropriate epistemic weight to each type of inference. Critics note that preregistration can be gamed (researchers may preregister vague hypotheses or deviate from the plan without disclosure), but these concerns address implementation quality rather than the principle itself.

Perhaps the strongest evidence for preregistration’s value comes from within-field comparisons: pre-registered studies in fields that have adopted the practice tend to show smaller, more conservative effect sizes than contemporaneous non-preregistered studies, consistent with the elimination of post hoc inflation.

5.2 Registered Reports

Registered Reports represent a structural innovation in the publication process, introduced at the journal Cortex in 2013 by Chris Chambers. In the Registered Reports format, authors submit their introduction and methodology for peer review before any data are collected. If the proposal is accepted — evaluated solely on the importance of the research question and the rigor of the methods — the journal grants “in-principle acceptance”: a commitment to publish the results regardless of their direction or statistical significance. Data collection and analysis then proceed, and the completed manuscript undergoes a second review focused on adherence to the registered protocol.

Registered Reports directly address publication bias at its source: editorial gatekeeping based on results. An accepted proposal will be published whether it produces p = 0.001 or p = 0.9. As of 2026, more than 300 journals across disciplines from neuroscience to clinical medicine have adopted the Registered Reports format, and an estimated several thousand studies have been completed under the format.

Comparison studies suggest that Registered Reports show higher rates of null results (approximately 40–50% of completed reports, compared with roughly 10% in the standard published literature) — which is precisely what would be expected if publication bias had been substantially reduced.

5.3 Open Data and Materials Sharing

Open data mandates — requiring authors to make raw data, analysis code, and materials publicly available at the time of publication — address both computational reproducibility and provide the infrastructure necessary for independent replication. When data and code are available, a replication attempt can verify not only the statistical conclusions but also the computational chain that produced them.

Journals including Psychological Science, PLOS ONE, and many Nature-family journals have implemented tiered open data badges (following the framework introduced by Kidwell et al., 2016, in PLOS Biology), incentivising data sharing through visible crediting. Mandates from major funders — including the UK’s UKRI and the US National Institutes of Health — have further accelerated adoption, though compliance and completeness of shared data remain variable.

A key obstacle is the absence of discipline-wide standards for data formatting, metadata, and archival. The FAIR Principles — requiring data to be Findable, Accessible, Interoperable, and Reusable — provide a widely adopted framework that moves beyond mere deposition towards genuinely reusable research objects. Our detailed guide to FAIR data principles for researchers covers the fifteen sub-principles and repository selection in full.

5.4 Statistical Power Planning and Larger Samples

Mechanically, the most direct route to improved replicability is to conduct studies with adequate statistical power. The convention of 80% power — detecting a true effect of a specified size 80% of the time — is a minimum rather than a gold standard; many methodologists recommend 90% or 95% power for confirmatory studies. Adequate power requires realistic effect-size inputs, which in turn requires either theoretical derivation or well-powered prior evidence rather than point estimates borrowed from small exploratory studies.

In practice, power planning must be combined with pre-specified stopping rules to prevent optional stopping from negating the power calculation’s validity. Sequential methods — including Bayesian sequential testing and group sequential designs — provide statistically valid frameworks for updating analyses as data accumulate, addressing the legitimate need for flexibility in data collection without inflating Type I error rates.

5.5 Multi-Lab and Adversarial Collaboration

Multi-lab projects, in which many research groups independently test the same hypothesis using standardised materials and protocols, have emerged as one of the most powerful tools for establishing reliable effect estimates. The “Many Labs” series in psychology — in which dozens of laboratories simultaneously replicate a set of classic findings — has provided unprecedented large-sample evidence about which effects are robust and which are fragile, and has documented substantial variability in effect sizes across laboratories and cultural contexts that single-site studies cannot detect.

Adversarial collaboration — in which scientists who hold competing hypotheses jointly design a study to discriminate between them — is a complementary approach that builds the replication test into the primary investigation. While logistically demanding, adversarial collaborations have a strong track record of producing findings that both parties accept as definitive.

6. Philosophical Implications

The reproducibility crisis has prompted a re-examination of some of the deepest assumptions in philosophy of science. The crisis does not uniformly support any single philosophical position, but it has renewed interest in several contested questions.

For those working within broadly positivist and post-positivist frameworks — which hold that empirical research reveals mind-independent regularities — the crisis is primarily a methodological problem: the right findings are out there; current practices are insufficiently reliable for finding them. Reform efforts within this tradition focus on improving statistical practice, transparency, and replication infrastructure.

For those working within constructivist and interpretivist traditions, the crisis is sometimes read as evidence for a broader scepticism about the project of discovering context-independent laws of behaviour. Social phenomena are constitutively shaped by cultural and historical context, and the failure to replicate a social psychology finding across laboratories in different countries might reflect genuine contextual variation rather than methodological failure. On this view, pursuing universal laws through laboratory experiments is an epistemologically misguided enterprise, and the energy devoted to replication might better be directed toward thick description and case-specific understanding.

A mediating position — consistent with critical realism — holds that underlying causal mechanisms are real but context-dependent in their expression: the same mechanism may produce different observable outcomes depending on the configuration of contextual conditions. Replication failures, on this view, are informative data about the boundary conditions of a mechanism, not merely failures of measurement. This framing preserves the value of replication while accommodating genuine contextual variation.

These debates are not merely academic. How a researcher understands the nature of the problem shapes the interventions they regard as appropriate — from the methodological reforms described above to more fundamental restructuring of how science is communicated, credited, and funded.

7. Frequently Asked Questions

What exactly does the “reproducibility crisis” mean in research?

The reproducibility crisis in research refers to the empirical finding that a substantial proportion of published scientific studies fail to produce the same results when independently repeated under comparable conditions. The term was widely adopted after the Open Science Collaboration (2015) found that only 36% of 100 psychology studies replicated with statistical significance. The crisis is understood as primarily a systemic problem produced by p-hacking, publication bias, low statistical power, and perverse incentive structures — not primarily by fraud.

Is the reproducibility crisis limited to psychology?

No. While early prominent evidence came from psychology, the crisis has been documented across disciplines. The Reproducibility Project: Cancer Biology (Errington et al., 2021, eLife) found substantial effect-size shrinkage in preclinical cancer research. Baker’s (2016) Nature survey found more than 70% of researchers across many fields had failed to replicate others’ work. Computational reproducibility failures have been documented in economics, genetics, neuroimaging, and climate science. The extent of the problem varies by discipline and is shaped by differences in statistical power norms, data-sharing culture, and theoretical constraint.

What is p-hacking and why does it cause reproducibility failures?

P-hacking describes the use of undisclosed analytical flexibility — trying different statistical tests, adding or removing covariates, deciding when to stop data collection, or selecting which outcome measures to report — until a p-value below 0.05 is achieved. Simmons, Nelson, and Simonsohn (2011, Psychological Science) demonstrated that individually defensible choices could collectively inflate the false-positive rate to over 60%. Because p-hacked findings reflect capitalisation on chance rather than genuine effects, they do not replicate: the same result cannot be produced when the analysis is pre-specified and confirmed in an independent sample.

Does preregistration solve the reproducibility crisis?

Preregistration is a valuable tool that addresses the specific problem of undisclosed analytical flexibility by separating planned confirmatory tests from exploratory analysis. It does not address all causes of the crisis: a poorly powered pre-registered study still produces unreliable estimates, and publication bias can still affect whether pre-registered null results get submitted and accepted. Preregistration is most effective as part of a broader reform package that also includes open data, adequate statistical power, and structural changes to journal publishing practices such as Registered Reports.

What are Registered Reports and how do they differ from standard peer review?

Registered Reports are a publication format in which the introduction and methodology are peer-reviewed before data collection begins. If the proposal passes review, the journal grants “in-principle acceptance” — a commitment to publish the completed paper regardless of the direction or significance of the results. This is fundamentally different from standard peer review, which evaluates completed manuscripts and therefore unavoidably biases selection toward positive findings. Registered Reports were introduced at the journal Cortex in 2013 by Chris Chambers and have since been adopted by more than 300 journals across disciplines.

How should readers of the scientific literature respond to the reproducibility crisis?

Informed readers of the scientific literature should calibrate their confidence in individual findings according to several factors: whether the study was pre-registered, whether data and analysis code are available, the statistical power and sample size, the number of independent replications, and whether the finding has been meta-analytically confirmed across multiple studies. Single statistically significant findings from small, non-pre-registered studies — even published in high-impact journals — should be treated as preliminary evidence requiring corroboration. Systematic reviews and meta-analyses that adjust for publication bias provide more reliable summary evidence than individual studies.

References

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. https://doi.org/10.1038/533452a
Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://doi.org/10.7554/eLife.71601
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600–2606. https://doi.org/10.1073/pnas.1708274114
Chambers, C. D. (2013). Registered reports: A new publishing initiative at Cortex. Cortex, 49(3), 609–610. https://doi.org/10.1016/j.cortex.2012.12.016
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217. https://doi.org/10.1207/s15327957pspr0203_4
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638

Write your thesis with AI

Structure, draft, cite, and format your thesis faster with Tesify’s AI writing tools, automatic bibliography, and plagiarism checker. Free to start, no credit card required.

Start free with Tesify

open science, p-hacking, preregistration, publication bias, registered reports, replication crisis, reproducibility crisis, research integrity, research methodology

[email protected]