- Bipolar Disorder
- Therapy Center
- When To See a Therapist
- Types of Therapy
- Best Online Therapy
- Best Couples Therapy
- Managing Stress
- Sleep and Dreaming
- Understanding Emotions
- Self-Improvement
- Healthy Relationships
- Student Resources
- Personality Types
- Guided Meditations
- Verywell Mind Insights
- 2024 Verywell Mind 25
- Mental Health in the Classroom
- Editorial Process
- Meet Our Review Board
- Crisis Support
What Is Replication in Psychology Research?
Examples of replication in psychology.
- Why Replication Matters
- How It Works
What If Replication Fails?
- The Replication Crisis
How Replication Can Be Strengthened
Replication refers to the repetition of a research study, generally with different situations and subjects, to determine if the basic findings of the original study can be applied to other participants and circumstances.
In other words, when researchers replicate a study, it means they reproduce the experiment to see if they can obtain the same outcomes.
Once a study has been conducted, researchers might be interested in determining if the results hold true in other settings or for other populations. In other cases, scientists may want to replicate the experiment to further demonstrate the results.
At a Glance
In psychology, replication is defined as reproducing a study to see if you get the same results. It's an important part of the research process that strengthens our understanding of human behavior. It's not always a perfect process, however, and extraneous variables and other factors can interfere with results.
For example, imagine that health psychologists perform an experiment showing that hypnosis can be effective in helping middle-aged smokers kick their nicotine habit. Other researchers might want to replicate the same study with younger smokers to see if they reach the same result.
Exact replication is not always possible. Ethical standards may prevent modern researchers from replicating studies that were conducted in the past, such as Stanley Milgram's infamous obedience experiments .
That doesn't mean that researchers don't perform replications; it just means they have to adapt their methods and procedures. For example, researchers have replicated Milgram's study using lower shock thresholds and improved informed consent and debriefing procedures.
Why Replication Is Important in Psychology
When studies are replicated and achieve the same or similar results as the original study, it gives greater validity to the findings. If a researcher can replicate a study’s results, it is more likely that those results can be generalized to the larger population.
Human behavior can be inconsistent and difficult to study. Even when researchers are cautious about their methods, extraneous variables can still create bias and affect results.
That's why replication is so essential in psychology. It strengthens findings, helps detect potential problems, and improves our understanding of human behavior.
How Do Scientists Replicate an Experiment?
When conducting a study or experiment , it is essential to have clearly defined operational definitions. In other words, what is the study attempting to measure?
When replicating earlier researchers, experimenters will follow the same procedures but with a different group of participants. If the researcher obtains the same or similar results in follow-up experiments, it means that the original results are less likely to be a fluke.
The steps involved in replicating a psychology experiment often include the following:
- Review the original experiment : The goal of replication is to use the exact methods and procedures the researchers used in the original experiment. Reviewing the original study to learn more about the hypothesis, participants, techniques, and methodology is important.
- Conduct a literature review : Review the existing literature on the subject, including any other replications or previous research. Considering these findings can provide insights into your own research.
- Perform the experiment : The next step is to conduct the experiment. During this step, keeping your conditions as close as possible to the original experiment is essential. This includes how you select participants, the equipment you use, and the procedures you follow as you collect your data.
- Analyze the data : As you analyze the data from your experiment, you can better understand how your results compare to the original results.
- Communicate the results : Finally, you will document your processes and communicate your findings. This is typically done by writing a paper for publication in a professional psychology journal. Be sure to carefully describe your procedures and methods, describe your findings, and discuss how your results compare to the original research.
So what happens if the original results cannot be reproduced? Does that mean that the experimenters conducted bad research or that, even worse, they lied or fabricated their data?
In many cases, non-replicated research is caused by differences in the participants or in other extraneous variables that might influence the results of an experiment. Sometimes the differences might not be immediately clear, but other researchers might be able to discern which variables could have impacted the results.
For example, minor differences in things like the way questions are presented, the weather, or even the time of day the study is conducted might have an unexpected impact on the results of an experiment. Researchers might strive to perfectly reproduce the original study, but variations are expected and often impossible to avoid.
Are the Results of Psychology Experiments Hard to Replicate?
In 2015, a group of 271 researchers published the results of their five-year effort to replicate 100 different experimental studies previously published in three top psychology journals. The replicators worked closely with the original researchers of each study in order to replicate the experiments as closely as possible.
The results were less than stellar. Of the 100 experiments in question, 61% could not be replicated with the original results. Of the original studies, 97% of the findings were deemed statistically significant. Only 36% of the replicated studies were able to obtain statistically significant results.
As one might expect, these dismal findings caused quite a stir. You may have heard this referred to as the "'replication crisis' in psychology.
Similar replication attempts have produced similar results. Another study published in 2018 replicated 21 social and behavioral science studies. In these studies, the researchers were only able to successfully reproduce the original results about 62% of the time.
So why are psychology results so difficult to replicate? Writing for The Guardian , John Ioannidis suggested that there are a number of reasons why this might happen, including competition for research funds and the powerful pressure to obtain significant results. There is little incentive to retest, so many results obtained purely by chance are simply accepted without further research or scrutiny.
The American Psychological Association suggests that the problem stems partly from the research culture. Academic journals are more likely to publish novel, innovative studies rather than replication research, creating less of an incentive to conduct that type of research.
Reasons Why Research Cannot Be Replicated
The project authors suggest that there are three potential reasons why the original findings could not be replicated.
- The original results were a false positive.
- The replicated results were a false negative.
- Both studies were correct but differed due to unknown differences in experimental conditions or methodologies.
The Nobel Prize-winning psychologist Daniel Kahneman has suggested that because published studies are often too vague in describing methods used, replications should involve the authors of the original studies to more carefully mirror the methods and procedures used in the original research.
In fact, one investigation found that replication rates are much higher when original researchers are involved.
While some might be tempted to look at the results of such replication projects and assume that psychology is more art than science, many suggest that such findings actually help make psychology a stronger science. Human thought and behavior is a remarkably subtle and ever-changing subject to study.
In other words, it's normal and expected for variations to exist when observing diverse populations and participants.
Some research findings might be wrong, but digging deeper, pointing out the flaws, and designing better experiments helps strengthen the field. The APA notes that replication research represents a great opportunity for students. it can help strengthen research skills and contribute to science in a meaningful way.
Nosek BA, Errington TM. What is replication ? PLoS Biol . 2020;18(3):e3000691. doi:10.1371/journal.pbio.3000691
Burger JM. Replicating Milgram: Would people still obey today ? Am Psychol . 2009;64(1):1-11. doi:10.1037/a0010932
Makel MC, Plucker JA, Hegarty B. Replications in psychology research: How often do they really occur? Perspectives on Psychological Science . 2012;7(6):537-542. doi:10.1177/1745691612460688
Aarts AA, Anderson JE, Anderson CJ, et al. Estimating the reproducibility of psychological science . Science. 2015;349(6251). doi:10.1126/science.aac4716
Camerer CF, Dreber A, Holzmeister F, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 . Nat Hum Behav . 2018;2(9):637-644. doi:10.1038/s41562-018-0399-z
American Psychological Association. Learning into the replication crisis: Why you should consider conducting replication research .
Kahneman D. A new etiquette for replication . Social Psychology. 2014;45(4):310-311.
By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."
- Previous Article
- Next Article
Statistical Considerations
Theoretical and methodological considerations, data accessibility statement, competing interests, author contributions, peer review comments, when and why to replicate: as easy as 1, 2, 3.
- Split-Screen
- Article contents
- Figures & tables
- Supplementary Data
- Peer Review
- Open the PDF for in another window
- Guest Access
- Get Permissions
- Cite Icon Cite
- Search Site
Sarahanne M. Field , Rink Hoekstra , Laura Bringmann , Don van Ravenzwaaij; When and Why to Replicate: As Easy as 1, 2, 3?. Collabra: Psychology 1 January 2019; 5 (1): 46. doi: https://doi.org/10.1525/collabra.218
Download citation file:
- Ris (Zotero)
- Reference Manager
The crisis of confidence in psychology has prompted vigorous and persistent debate in the scientific community concerning the veracity of the findings of psychological experiments. This discussion has led to changes in psychology’s approach to research, and several new initiatives have been developed, many with the aim of improving our findings. One key advancement is the marked increase in the number of replication studies conducted. We argue that while it is important to conduct replications as part of regular research protocol, it is neither efficient nor useful to replicate results at random. We recommend adopting a methodical approach toward the selection of replication targets to maximize the impact of the outcomes of those replications, and minimize waste of scarce resources. In the current study, we demonstrate how a Bayesian re–analysis of existing research findings followed by a simple qualitative assessment process can drive the selection of the best candidate article for replication.
In 2005, Ioannidis published a theoretical article ( 2005 ), in which he argued that more than half of published findings may be false. The landmark mass replication effort of the Open Science Collaboration (henceforth OSC; 2015 ) gave empirical support for Ioannidis’s claims a decade after they were made, but reported an even bleaker narrative. Only 36% of replication studies were successful in yielding a result comparable to that reported in the original article (more recent mass replication attempts have revealed similarly low reproducibility levels: Klein et al., 2018 ; Camerer et al., 2018 ). To put this finding in context: had all of the original results been true, a minimum reproducibility rate of 89% would be expected, according to the OSC ( 2015 ). These figures reflect the gravity of what is now known as the crisis of confidence, or replicability crisis, in science. Though the discussion began in psychology, reports of unsatisfactory reproducibility rates have come from many different fields in the scientific community ( Baker 2016 ; Begley & Ioannidis, 2015 ; Chang & Li, 2018 ).
The literature has suggested a number of potential causes for poor reproducibility of research findings. One of the most obvious candidate causes is the publish or perish culture in academia ( Grimes, Bauch, & Ioannidis, 2018 ), which describes the pressure on researchers to publish much and often in order to maintain their university faculty positions, or to move up the hierarchical ‘ladder’. Another possible cause is the alarmingly high prevalence of QRPs (questionable research practises) in which researchers engage. HARKing (hypothesizing after the results are known), p-hacking (where one massages the data to procure a significant p-value) and the ‘file drawer’ problem (where researchers do not attempt to publish their null results) are all examples of QRPs ( Kerr, 1998 ; John, Loewenstein, & Prelec, 2012 ; Rosenthal, 1979 ). They lead to a literature that is unreliable, and apparently in many cases (and often as a result), impossible to replicate.
Irrespective of the causes of the crisis of confidence, its consequence is irrefutable: scientific communities are questioning the veracity of many of the key findings of psychology, and are hesitant to trust the conclusions upon which they are based. A recent online Nature news story suggested that most scientific results should not be trusted ( Baker, 2015 ). Research psychologists are asking whether science is “broken” ( Woolston 2015 ); others have referred to the “terrifying unraveling” of the field ( Aschwanden, 2016 ). Proposed solutions to this crisis of confidence have revolved around reviewers demanding openness as a condition to provide reviews ( Morey et al., 2016 ), guidelines for more openness and transparency ( Nosek et al., 2015 ), preregistration and registered reports ( Dablander, 2017 ), and funding schemes directly aimed at replication (such as those of the Netherlands Organisation for Scientific Research: https://www.nwo.nl/en/news-and-events/news/2017/social-sciences/repeating-important-research-thanks-to-replication-studies.html ).
These initiatives, while a first step in the right direction, only go so far to remedy the problem because they are preemptive in nature; only prescribing best practice for the future . They cannot help untangle the messy current literature body we continue to build upon. The most direct way to get more clarity about previously reported findings may well be through replication (but see Zwaan, Etz, Lucas, & Donnellan, 2018 , who discuss some caveats). Therefore, psychological science needs a way to separate the wheat from the chaff; a way to determine which findings to trust and which to disregard. Replication of existing empirical research articles is a practical way to meet this dire need. The need for replications introduces a second generation of complications related to interest in conducting replication studies: a flood of new replications of existing research can be found in the literature, and more are being conducted.
In theory, this up-tick in the number of replications being conducted is a good development for the field (especially given that up until recently, replication studies only occupied about 1 percent of the literature body: see Makel, Plucker, & Hegarty, 2012 ), however in practice, so much interest in conducting replications leads to a logistical problem: there exists a vast body of literature that could be subject to replication. The question is: how does one select which studies to replicate from the ever-increasing pool of candidates? Which replications retread already ‘well-trodden ground’, and which move research forward ( Chawla, 2016 )? These questions have serious practical implications, given the scarcity of resources (such as participants and time) in many scientific research fields.
Several recommendations to point us in the right direction exist in the literature already. A great number of these happen to be conveniently grouped as commentaries to Zwaan and colleagues’ recent impactful article: ‘Making Replication Mainstream’ (2018). For instance, Coles, Tiokhin, Scheel, Isager and Lakens ( 2018 ), and Hardwicke and colleagues ( 2018 ) urge potential replicators to use a formalized decision making process, and only conduct a replication when the results of a cost-benefit analysis suggest that the benefits of such a replication outweigh the associated costs. Additionally, they emphasize considering other factors such as the prior plausibility of the original article’s reported effects. Kuehberger and Schulte-Mecklenbeck ( 2018 ) argue against selecting replications studies at random and discuss potential biases that can emerge in the process of selecting studies to replicate. Little and Smith single out problems with existing literature as reasons for replication failure (such as weak theory measurement), which can be reasons for targeting some original studies for replication, over others. Finally, Witte and Zenker ( 2018 ) recommend replication only those studies which provide theoretically important findings to the literature. Coming from the opposite angle, Schimmack ( 2018 ) provides reasoning as to why to not replicate certain studies, which, naturally, is also useful in refining ones selection criteria for replication targets.
One could say that, broadly, there are three different facets to selecting replication targets, associated with the different information contained in a published article: statistical, theoretical and methodological. In the next two subsections, we first discuss statistical considerations and then theoretical and methodological considerations in turn.
First, studies can be selected for replication when their claims require additional corroboration, based on the statistical evidence reported in the publication. This is a statistical approach to determining what should be replicated first. Null-hypothesis significance testing (or NHST) dominates the literature, meaning that the bulk of statistical testing involves reporting p-values.
Although there are numerous downsides to using NHST to quantify scientific evidence (for a discussion, see Wagenmakers, 2007 ), we focus on one key drawback here which relates directly to our discussion. The p-value only allows us to reject the null hypothesis: there is a single evidence threshold, meaning that we cannot use the p-value to gather evidence in favor of the null hypothesis, no matter how much evidence may exist for it. Given that it is unlikely that each study reporting an effect is based on a true main effect ( Ioannidis, 2005 ), but that studies rarely use statistical techniques to quantify evidence for the absence of an effect, there is a mismatch in what we can conclude and what we want to conclude from our statistical inference ( Haucke, Miosga, Hoekstra, & van Ravenzwaaij, 2019 ).
One alternative to quantifying statistical evidence with the conventional NHST framework is by means of Bayes factors . Throughout this paper, we will use a relatively diffuse default prior distribution for effect size to reflect the fact that we do not possess strong prior information (see also Etz & Vandekerckhove, 2016 ). In this paper we examine scenarios calling for a t-test. For such designs, one of the most prominent default specifications uses what is known as the Jeffreys-Zellner-Siow (JZS) class of priors. Development of these so-called JZS Bayes factors have been built on the pioneering work of Jeffreys ( 1961 ) and Zellner and Siow ( 1980 ). The JZS Bayes factor quantifies the likelihood of the data under the null hypothesis (with effect size δ = 0) relative to the likelihood of the data under the alternative hypothesis. For a two-sided test, the range of alternative hypotheses is given by a prior on the effect size parameter δ, which follows a Cauchy distribution with a scale parameter r = 1 / 2 (see, Rouder, Speckman, Sun, Morey, & Iverson, 2009 , equation in note 4 on page 237). In terms of interpretation, a Bayes factor of BF 10 = 5 means the data are 5 times more likely to have occurred under the alternative hypothesis than under the null hypothesis. In comparison, a Bayes factor of BF 10 = 1/5 (or the inverse of 5), means the observed data are five times more likely to have occurred under the null hypothesis than under the alternative hypothesis. 1
The application of the JZS Bayes factor for a large-scale reanalysis of published results is not without precedent ( Hoekstra, Monden, van Ravenzwaaij, & Wagenmakers, 2018 ). We build upon the work of Hoekstra and colleagues in taking the results of such a Bayesian reanalysis as a starting point for selecting replication targets (for a similar approach, see Pittelkow, Hoekstra, & van Ravenzwaaij, 2019 ).
So why not simply use p-values as our selection mechanism for existing statistical evidence? When NHST results are reanalyzed and transformed into Bayes factors, the relationship between Bayes factors and p-values can be strong if the analyzed studies have mostly comparable sample sizes ( Wetzels et al., 2011 ; Aczel, Palfi, & Szaszi, 2017 ). However, when studies have differing sample sizes, this relationship is no longer straightforward (for instance, see Hoekstra and colleagues ( 2018 ), who show that for non-significant findings the strength of pro-null evidence is better predicted by N than by the p-value, and that larger N studies are “more likely to provide compelling evidence”, p. 6). Consider the following example for illustration.
We have two results of classical statistical inference:
Scenario 1: t(198) = 1.97, p = .05 Scenario 2: t(199998) = 1.96, p = .05
In both cases, the p-value is significant at the conventional alpha-level of .05, however due to the very different sample size in both scenarios, these two sets of results reflect very different levels of evidential strength. The Bayes factor, unlike the p-value, can differentiate between these two sets of results. Through the lens of the Bayes factor, scenario 1 presents ambiguous evidence: BF 10 = 0.94 (i.e., the data is about equally likely to occur under the null hypothesis as under the alternative hypotheses). A Bayes factor for scenario 2 presents strong evidence in favor of the null: BF 10 = 0.03 (i.e., the data is about 29 times more likely to occur under the null hypothesis than under the alternative hypothesis). Using the p-value as a criterion for which study to replicate would not differentiate between these two scenarios, whereas the Bayes factor allows us to decide that in case of Scenario 2, we have strong evidence that the null hypothesis is true (and so, arguably, no further replication is needed), whereas in case of Scenario 1, the evidence is ambiguous and replication is warranted.
In this paper, we apply a Bayesian reanalysis to several recent research findings, the end-goal being to demonstrate a technique one can use to reduce a large pool of potential replication targets to a manageable list. The Bayesian reanalysis is diagnostic in the sense that it can assist us in separating findings into three classes, or tiers of results: (1) results for which the statistical evidence pro-alternative is compelling (no replication is needed); (2) results for which the statistical evidence pro-null is compelling (no replication is needed); (3) results for which the statistical evidence is ambiguous (replication may be needed depending on theoretical and methodological considerations). We reiterate here that, crucially, p-values are unable to differentiate between results which belong in the second of these categorical classes, and those that belong in the third. The third class of studies will be carried into the next ‘phase’ of our demonstration, wherein we further scrutinize study results with ambiguous statistical evidence on theoretical and methodological considerations that might factor into the decision to replicate.
Mackey ( 2012 ) provides some pointers on how one may select a replication target based on the theoretical content of a reported research finding. She suggests that in order to qualify as a ‘candidate’ for replication, a study should address theoretically important (for short, ‘theoretical importance’) and currently relevant research questions (‘relevance’). A study also qualifies if it concerns studies in the field that are accepted as true in the field, but have yet to be sufficiently investigated (‘insufficient investigation’). 2 The theoretical approach will be explained as we describe it in a practical application later in the paper.
The last facet to selecting replication studies concerns methodological information. While many aspects of a study’s methodology are highly specific to the paradigm of the article in question (e.g., the use of certain materials like visual stimuli), some elements of methodology can be discussed in general (e.g., sample size). As with the theoretical facet, methodology will be discussed in more detail during the later demonstration.
A replication study itself is beyond the scope of this paper, however we offer a demonstration of how the combined use of theory and Bayesian statistics can drive a methodical and qualitative approach to selecting replication targets in the psychological sciences. Additionally, we offer theoretical and methodological recommendations, in case such a replication were to be conducted. Please note that although the theoretical context and methodology of a study is important for selecting studies for replication, our demonstration focuses primarily on applying the Bayesian reanalysis to this challenge.
The remainder of this paper is organized as follows. In the method section, we share details of our treatment of the replication candidate pool in the reanalysis phase. We then describe the results of the initial selection process, before moving on to describing the qualitative phase of the filtering process. We make recommendations based on our selection process, for a fictional replication study. The article ends with our discussion, wherein we justify certain subjective choices we have made and consider philosophical issues, and share the limitations of our method.
We extracted statistical details from articles in the 2015 and 2016 Psychological Science and performed a Bayesian reanalysis to make a first selection of which studies could be targets for replication, based on the evidential strength of the results reported. Once this initial selection was made, we further refined the selection based on the theoretical soundness of the conclusions drawn from the selected studies, and considered the support for the finding which exists in the literature already. The approach combined quantitative and qualitative methods: on the one hand, the initial selection was based on an empirical process, and on the other, the refinement of the selection was based on a process involving judgments of the findings in the context of the literature and theory. The process took the first author less than a working week to complete. Given that we provide the reader with the reanalysis code, and the spreadsheet with the necessary values to complete the reanalysis, we believe that attempts by others to use our method for a similarly sized sample would not be any more time-intensive than our original execution.
All Psychological Science articles from 2015 and 2016 issues were searched for reported significant statistical tests (one–sample, paired, and independent t-tests), associated with primary research questions. As mentioned, we used statistical significance as our criterion for selecting results to reanalyze. All of the articles reporting t-tests to test their main hypotheses used p-values to quantify their findings. We extracted the t-values and other details required for the reanalysis (including N and p-value) for 30 articles which contained t-tests (the data spreadsheet which logs these details for each statistic extracted is on the project’s Open Science Framework (OSF) page at https://doi.org/10.17605/OSF.IO/3RF8B ).
Incomplete or unclear reporting practices posed a challenge in the first step of selecting which articles to reanalyze. Determining whether the executed tests were one- or two-sided was often difficult, as articles frequently failed to report the type of test conducted. Several articles which used t-tests as part of their main analysis strategy were ultimately not included in the reanalysis, as not all information was available (not even to the extent that we could reverse engineer other necessary details). One article, which reported two t-tests in support of their main finding, was excluded from the final reanalysis. Due to unclear reporting, we were unable to identify what the study’s method entailed, and, therefore, how the reported results were reached. We explore the reporting problem in detail in the discussion section.
In total, from the 24 issues of 2015/2016 Psychological Science , 326 ‘research articles’ and ‘research reports’ were manually scanned for studies in which a major hypothesis was tested using a t-test. Of these, 57 results were derived from 30 individual articles. Several articles reported more than one primary experimental finding which was analyzed using a t-test. Different approaches yielded judgments of whether or not a finding was of focal importance. First, if a specific finding was reported in the abstract, it would be selected (where possible). The rationale for this approach was that the abstract has only got space for documenting the most important results of the study, thus only key findings will be reported in it. A finding was also selected if somewhere in the article it was tested in a primary hypothesis, or was explicitly noted by the authors of the article as being important for the study’s conclusions. Many articles reported several t-tests in support of a single broader hypothesis. In such cases we attempted to select the results which most directly supported the author’s conclusions.
Descriptive Results
P-values, test statistics, sample sizes and test sidedness were collected for the purpose of the reanalysis. The p-values ranged in value; the largest was .047. The test statistics and sample sizes obtained also ranged greatly. The absolute test statistics ranged from 2.00 to 7.49. The range of the sample sizes is from N = 16 to N = 484. The distribution of study sample sizes is heavily right-skewed. The median for this sample is 54 – smaller than recent estimates of typical sample sizes in psychological research ( Marszalek, Barber, Kohlhart, & Cooper, 2011 ).
In the Bayesian reanalysis, we converted reported information extracted from articles into Bayes factors, to assess the strength of evidence given by each result. 3 The Bayes factors range widely: 0.97 to 1.9 × 10 10 , or approximately 19 billion. Almost half of them are between 1 and 5.
A clear negative relationship between the Bayes factors and the reported p-values is shown in Figure 1 . Despite the nature of this relationship, some small p-values are associated with a range of Bayes factors (around the p = .04 mark, for instance). A positive relationship between Bayes factors and sample sizes can be seen in Figure 2 . Unsurprisingly, larger sample sizes are generally associated with larger Bayes factors ( r = .71), though it is not the case that large sample sizes are always associated with more compelling Bayes factors. For instance, many cases in the N = 200 region are associated with somewhat weak Bayes factors. In one case, the overall N of 30 converts to a Bayes factor of over 151,000, in another case, the overall N of 35 is associated with a Bayes factor of over 21,000.
Scatterplot of Bayes factors and p-values plotted on a log-log scale. The horizontal dashed lines indicate Jeffreys’ thresholds for anecdotal evidence (3, for pro-alternative cases, and the inverse for pro-null cases). The vertical red line demarcates the conventional significance level for p-values.
Scatterplot of Bayes factors and sample size plotted on a log-log scale. The horizontal dashed lines indicate Jeffreys’ thresholds for anecdotal evidence (3, for pro-alternative cases; the inverse for pro-null cases). The cases in which we are interested for the reanalysis, those in tier 3, lie between the two finely dashed lines.
Quantitative Target Selection
In this paper, we will make an initial selection based on those studies in tier 3: whose results yield only ambiguous evidence in relation to support for their reported hypotheses. For this purpose, we will judge such ambiguity, or low evidential strength, as when a study’s BF 10 lies between 1/3 and 3, which, by Jeffrey’s (1961) classification system provides no more than ‘anecdotal’ evidence for one hypothesis over the other.
Using the BayesFactor package in R ( Morey, Rouder, & Jamil, 2015 ), we calculated Bayes factors (BF) for each test statistic using the extracted test statistics, and other information gathered: p-values, test statistics, sample sizes and sidedness of the test. While the vast majority did not explicitly state that they were confirmatory, most results were presented as though they were. The code written for the analysis which is associated with the data spreadsheet can be found at the project’s OSF page: https://doi.org/10.17605/OSF.IO/3RF8B .
The reanalysis revealed that the Bayesian reanalysis placed 20 results in evidence tier 3. One of these yielded a Bayes factor below 1 (0.97), which, by Jeffrey’s classification system, demonstrates anecdotal pro-null evidence. The remainder of the results lie in tier 1. As we were only interested in those articles for which an effect was reported, no results falling in tier 2 (those with compelling pro-null evidence) exist in this dataset. The reanalysis has reduced the pool of results from 57 to 20 candidates for replication. We now move onto the next stage of target selection.
Qualitative Target Selection
Of the 20 results in tier 3, we select those demonstrating the weakest evidence for their effects. If there is an article for which many results fall in tier 3, these will also be considered. 4 We will then conduct an assessment based on the qualitative criteria of Mackey ( 2012 ): theoretical importance, relevance, and insufficient investigation. Alongside Mackey’s criteria, we consider the need for the finding in question to be replicated under different study conditions or with a different sample than the original (to establish the external validity of the effect in question), as well as replication feasibility (for instance, can this study be replicated by generally-equipped labs, or are more specific experimental set-ups necessary?). We will refer to the articles by the article number we have given them (the article and reanalysis details corresponding to these can be found in Appendix A; a full table of all the details can be found on the OSF page for this project, at https://doi.org/10.17605/OSF.IO/3RF8B ).
The first to consider is the article revealed by the reanalysis to contain anecdotal pro-null evidence in one of its studies: article 8, from Dai, Milkman and Riis ( 2015 ). The authors of article 8 report on the so-called ‘fresh start effect’. This effect refers to the use of temporal landmarks to initiate goal pursuit. More specifically, the authors’ report supports claims that certain times of year (for instance, New Year’s Eve) are especially potent motivators for starting new habits (such as working out, or eating more wisely). Although some evidence in this article is weakly pro-null (result 8a), one strike against naming article 8a as a suitable target for replication, is that the article contains a second result we reanalyzed (result 8b) which yielded a Bayes factor of 5.05 (constituting pro-alternative evidence: Gronau et al., 2017 ). 5
In terms of Mackey’s ( 2012 ) criteria, the study is difficult to judge as a replication target. Article 8’s topic is theoretically important and certainly currently relevant: understanding the relationship between motivation and initiating healthy eating behavior is important for many reasons (for developing strategies to lowering the global burden of health due to preventable disease, for instance). However, the link between temporal landmarks and motivation has been demonstrated often and by different research groups ( Peetz & Wilson, 2013 ; Mogilner, Hershfield, & Aaker, 2018 ; Urminsky, 2017 ), as well as in other studies by related groups ( Dai, Milkman, & Riis, 2014 ; Lee & Dai, 2017 ), including a randomized clinical trial measuring adherence to medical treatment ( Dai et al., 2017 ).
Although this phenomenon has been the subject of many different studies, and the content of article 8 lends itself to interesting replications in which one varies, for instance, the culture of the sample, existing literature in the area already demonstrates the effect in other cultures than the USA (e.g., Germany: Peetz & Wilson, 2013 ), it is not a clear replication target, in our assessment.
The majority of the remaining results in tier 3 show Bayes factors that are homogeneous in terms of their magnitude– for instance, half of the results have a Bayes factor between 1 and 2. Additionally, for articles with multiple reanalyzed studies, we see only one case in which each of these studies fell into tier 3. They may reflect one study of many in an article which overall, through other tests, provides strong evidence of a main effect. Both of these reasons render the majority of the sample less attractive as replication candidates.
Despite this, two articles (both featuring multiple low Bayes factors each) are potential targets. 6 We now commit these to the qualitative assessment to determine their suitability for replication, in no particular order.
One potential replication target is article 4 ( Reinhart, McClenahan, and Woodman, 2015 ), in which the hypothesis that using mental imagery, or ‘visualizing’ can improve attention to targets in a visual search scene was tested. The authors recorded reaction times (RT) and event-related potentials (quantified as N2pc amplitudes, which reflect ongoing neural processes – in this case, attention) in response to the provided stimuli. They reported support for their hypothesis: imagining the visual search for certain targets did increase the speed at which participants focused on the specified targets (indexed by the ERP), before the motor response of pressing a button to confirm they had located the target. This article yielded three t-tests (each testing the experimental conditions on RT), which are of interest to us. We refer to them as results 4a through 4c, respectively. They appear in the results for the first experiment, which we judged to be a clear test of their primary hypothesis. Each of these t-tests correspond to a small Bayes factor. The RT tests correspond to Bayes factors of 3.19, 1.99 and 2.02, while the EEG tests yielded Bayes factors of 1.83 and 2.53. (the two other t-tests in the sets were not significant, thus are not of interest to us for the purposes of this reanalysis).
This article meets several of the qualitative criteria too. First, the topic is theoretically important and currently relevant. Training the brain for better performance has been gaining momentum in the past decade, partly prompted by several articles that support the positive link between video-gaming and improved mental performance in different cognitive domains (such as attention: Green & Bavelier, 2012 ). Exploring the link further with studies such as this can be beneficial to many areas of psychology and medicine (e.g., for working with patients of brain damage that are undergoing rehabilitation). Second, there is little supporting evidence for the link between visualization and improved attention; importantly, some of the literature aiming to reinforce the findings of article 4 contradicts it. For instance, the preregistered failed replication and extension of article 4’s experiments conducted by Clarke, Barr and Hunt ( 2016 ) showed repeated searching – not visualization – improved attention. Other factors to consider are generalization and feasibility. The suitability of article 4 as a replication target is supported by fact that this article has already been a target for replication, and that that replication did not conclusively reinforce its conclusions. It is possible that this study should be weighted differently in the sample due to the previous replication. Indeed, one could numerically account for the evidence contributed by the existing replication (e.g., Gronau et al., 2017 ). We consider that to be outside of the scope of this paper.
Their sample for experiment one was comprised of adults between the ages of 18 and 35, with a gender split of 62% to 38% in favor of women. The findings of article 4 could benefit from a replication using a different sample: for instance, one with individuals from an older age range. Although age is not thought to impair neuroplasticity, older persons exhibit plasticity occurring in different regions of the brain than younger persons influencing the mechanisms underlying visual perceptual learning ( Yotsumoto et al., 2014 ), which may influence their response to the stimuli presented in the experiments in the article. This has implications for the generalizability of the results. Another potentially important factor for consideration is gender. A recent review article by Dachtler and Fox ( 2017 ) reports clear gender differences in plasticity that are likely to influence several cognitive domains (including learning and memory), due to circulating hormones such as estrogen, which are known to influence synaptogenesis. To summarize, we find article 4 to be suitable as a replication candidate. Specifically, some of its findings could benefit from external reinforcement in the form of a conceptual replication in which factors such as age and gender are taken into consideration. Further, the results may benefit from a more in-depth exploration into the effect of searching versus visualization on attention.
Another replication target that our sample yielded is article 12: Kupor, Laurin and Levav ( 2015 ). Mentioned above, all reanalyzed results of this article (i.e., a through c) fall into tier 3. Article 12 (which includes 5 studies, each with sub-studies), explores the general hypothesis that reminders of God increase risk-taking behavior. In study 1, which this reanalysis focused solely on (as it most directly tested the key hypothesis), four sub-studies are identified: 1a, 1b, 1c and 1d. The first three contain t-tests, while the fourth contains a chi-square test. We consider only the results of 1a through c (12a through c) for the current reanalysis.
In the study corresponding to result 12a, participants performed a priming task involving scrambled sentences. Half the participants were primed with concepts of God, by way of exposure to words such as “divine” (p. 375). The other half, which forms the control group, were exposed only to neutral words. Once participants were primed, they completed a self-report risk-taking scale which was explained to participants as being an unrelated study. This scale revealed their likely risk-taking behavior in a one to five Likert scale. In the study yielding result 12b, following the manipulation, participants described the likelihood that they would attempt a risky recreational task that they had described themselves at an earlier point. In the study corresponding to result 12c, participants were tested on their interest in risk-taking via a behavioral measure, once they were primed in the first phase of the experiment. In each of these three experiments, participants primed with concepts of God reported or behaved as predicted: more predisposed to risk-taking than their neutrally-primed counterparts. Despite these three experiments yielding significant p-values, the reanalysis revealed three Bayes factors all suggesting the evidence is ambiguous: 1.96, 1.68 and 1.83, respectively for results 12a–c.
We now assess article 12 on the qualitative factors we described earlier. First, we consider the theoretical importance and current relevance of this article. Given that the majority of the world identifies as being religious (84%, according to recent statistics: Hackett, Stonawski, Potančoková, Grim, & Skirbekk, 2015 ), understanding the role of religion in moderating behavior is important, to say the least. According to the authors of article 12, behavior modification programs such as those employed for drug and alcohol rehabilitation use concepts of God and religion as a tool to reduce delinquent behavior. While this topic has attracted the attention of several research groups globally (meaning the article does not naturally meet the ‘insufficient evidence’ criterion), the reanalyzed results in article 12 go against the majority of this body of work: “… we propose that references to God can have the opposite effect, and increase the tendency to take certain types of risks” (p. 374), and do not seem to have direct strong support in the literature as yet (a paucity of indirect support can be found, e.g., Wu & Cutright, 2018 ).
In assessing the characteristics of article 12’s sample, some details indicating the suitability of article 12 for replication come to light. First, article 12 reports using Amazon’s Mechanical Turk online workforce, which is comprised of approximately 80% U.S.-based workers, and 20% Indian workers. Given that the majority of the Mechanical Turk workers are from the U.S., and the overwhelming majority of the U.S. reports being affiliated with Christianity, we expect that the majority of this sample respond with a mindset of trusting in a God which is thought to intervene on the behalf of the faithful, responding to prayers for things like healing, guidance and help with personal troubles. The results of article 12 might be very different if the participant pool contained mostly practitioners of Buddhism (for example), as Buddhism emphasizes the importance of enlightenment (when an individual achieves an understanding of life’s truth), and personal effort, rather than the intervention of a divine being (which is relevant given that feelings of security are thought to increase willingness to engage in certain behaviors: p. 374).
The age of article 12’s sample is also relevant to their results, considering that the majority of workers (>50%) were born in the 1980s. Recent polls indicate that younger individuals across Europe, the USA and Australia are less religious than their older counterparts ( Harris, 2018 ; Wang, 2015 ; Schneiders, 2013 ), meaning that a successful replication of article 12’s results with a predominantly aged population (as opposed to the mean ages of 23, 31 and 34 years, reported in the article) would demonstrate the generalizability of the finding that God-priming increases risk taking. 7 Another possibility also relates to age – perhaps the effect is greatly decreased in aged persons, simply by virtue of maturity: Risk-taking, even for rewards, decreases as a function of age ( Rutledge et al., 2016 ).
Our reanalysis of article 12’s results, in conjunction with other methodological and theoretical criteria considerations heavily underlines this replication candidate as a promising target, reporting results that are in need of independent corroboration. We recommend a direct, or pure replication, such that the findings exactly as they are presented can be verified. In addition, we recommend a conceptual replication in which significant changes to the characteristics of the sample are made (e.g., as mentioned, on the basis of the participants’ ages and religions).
In this paper, we performed a large scale reanalysis of the results of a selection of articles published in Psychological Science in the years of 2015 and 2016 for which primary research findings were quantified by t-tests. Reanalyzing these results narrowed the pool of potential replication targets from 57 to 20 candidates. The Bayes factors for these candidate studies were between 0.97 and 2.85. To further our demonstration, we selected three articles, and subjected them to the second phase of the selection process, involving qualitative assessment. The qualitative process revealed that two of these articles are suitable for replication: their findings are theoretically important and relevant, but the literature largely lacks direct corroborating evidence for the claims thus far. It revealed that the results could benefit from changes to the magnitude of the samples, and that several variables should be included in conceptual replications to help generalize the reported results beyond the original articles.
A set of replications for articles 4 and 12 could first provide support for the existence of an effect, given the results of the Bayesian reanalysis. Once an underlying true effect is found to likely exist via a direct replication, further conceptual replications could be designed to explicitly explore other cohorts to better establish the generalizability of the findings beyond the original experimental cohort. In the case of article 4, specifically targeting participants of certain age groups may be beneficial to help determine the malleability of the effect across the lifespan. For article 12, targeting specific religious groups may assist in helping establish whether the God priming effect extends to other religions for which God is not a figure directly associated with intervention. These conceptual replications could also feature designs which vary from the originals – for instance, a replication of article 4 could feature a design in which gender is a blocking variable, or even included as a variable of interest.
Replications for both articles should contain much larger sample sizes, to help eliminate issues of reliability. In order to conduct a compelling replication study, one may need a sample size greater than that in the original study, depending on how large the sample is in the original study. Low experimental power produces some problems with reliability of original findings, leading to poor reproducibility even when other experimental and methodological conditions are ideal, which they rarely are ( Button et al., 2013 ; Wagenmakers & Forstman, 2014 ).
A simulation by Button and colleagues ( 2013 ) demonstrates an argument against the common misconception that if a replication study has a similar effect size to the original, the replication will have sufficient power to detect an effect. They show that “… a study that tries to replicate a significant effect that only barely achieved nominal statistical significance (that is, p ~ 0.05) and that uses the same sample size as the original study, will only achieve ~50% power, even if the original study accurately estimated the true effect size” (p. 367). This indicates that in order to obtain sufficient power (say, 1– β = .8) for a medium effect size in a replication study, the original sample would need to be more than doubled. In terms of the sample size in question, this indicates an increase from N = 105 to N = 212 for each of the replication studies.
Choice Justifications
Prior choice.
Though we do not want to rehash decades of debate about prior selection, our use of a Bayesian approach in our reanalysis stage, necessitates a brief discussion on our choice of prior. We have chosen to use the default prior – the Cauchy – in the BayesFactor package. This choice is suitable for our goals for a few reasons (and we recommend that the typical user use the package defaults for the same reasons). First, the Cauchy prior’s properties make it an ideal choice for a weakly informative prior based on ‘general desiderata’ ( Jeffreys, 1955 ). Second, even if we did want to use a subjective prior, the most obvious approach to doing so would yield unreliable results. Using the existing literature on an effect to inform one’s prior choice would be a poor idea due to publication bias. Other factors exist that complicate subjective prior use. For instance, the existing literature on a particular phenomenon might be conflicting (in which case, the ‘right’ subjective prior might not exist), or may be very sparse (in which case little information would be available to adequately inform the prior). This being said, there are potential users of our method that may have sufficient expertise to navigate this complex situation and wish to select an alternative to the Cauchy prior. We refer such users to Verhagen and Wagenmakers ( 2014 ) or to Gronau, Ly and Wagenmakers ( 2019 ), both of which deal with Bayesian t-tests with explicit prior information available.
Selection Based on Significance
We used statistical significance as the criterion for selecting results for the Bayesian reanalysis. One may wonder why we have not chosen to inspect the claims of the non-existence of an effect based on a non-significant p-value. We have two reasons for using statistical significance (that is, when original article authors used statistical significance to justify their claims). First, although we believe statistical significance is hardly diagnostic of a true effect, the lack of statistical significance being related to no effect is even more complicated. If one were to try to replicate a non-significant result, what would the result say of the original effect? This problem does not exist for, say, an original study with a strong pro-null Bayes factor result, as the Bayes factor allows us to actually quantify pro-null evidence.
Finally, some applications of our method could be constrained by the capabilities or resources of replicating labs – not all suitable replication candidates can be replicated by all interested parties, as shown in our description above. The study of article 4 is worthwhile as a replication target and warrants further investigation, however it requires specialized equipment and specific expertise to be recreated, and is therefore only feasible for select labs to seriously attempt. On the other hand, article 12 features a less specialized set of materials that could be recreated by a research group using easily-accessible university provided software (e.g., Qualtrics) and web-browsers.
Limitations of the reanalysis should be noted. It is not always clear from the reporting articles which test statistic is most suitable to extract for purposes of reanalysis. One main reason for this difficulty was outlined earlier in the methods section of the study – inconsistent reporting practices. Despite a clear and detailed article published in American Psychologist by the APA in 2008 that discusses desirable reporting standards in psychology, and other initiatives in other fields to improve research reporting (e.g., the guidelines developed to improve the reporting of randomized-controlled trials in health-related research: Moher, Schulz, & Altman, 2001 ), many researchers in the social sciences have failed to adopt them (Mayo-Wilson, 2013). To be clear, poor standards of reporting are not the norm only in psychological science. To illustrate: Mackey ( 2012 ) in linguistics research states that insufficient reporting of details important for replication is problematic in many studies (p. 26); Button and colleagues ( 2013 ) in biomedical research, discuss the relationship between insufficient reporting of statistical details and false positives in results. We also recognize that it is difficult to manage a good balance between adequate reporting and the word limit in many (especially higher-impact) journals. Though, on the other hand, authors can upload supplementary documents to the various platforms available (the Open Science Framework, or Curate Science, for instance), or submit.
Another limitation regards our reanalysis of only t-tests. While reanalysis of more complex designs is possible using the Bayes factor package, we only demonstrate with the simpler design of the t-test. We intend to show, by this demonstration, a proof of concept of a methodical and evidence-driven approach to choosing targets for replication. The Bayesian reanalysis is a clear strength, from which replicating labs can draw, however we do not advocate only the use of a Bayesian reanalysis. We must consider factors that place the article and its content in context. We must consider its appropriateness as a study for replication (is a replication feasible for less well-equipped or specialist labs?), as well as the literature body it is part of. Is the study generally well supported, or does it tell a story conflicting with existing findings? Is it theoretically important, and does it hold relevance in its current historical, social and cultural context?
The reader may wonder why we have chosen not to assess the soundness of certain aspects of the methodologies of the original studies as a criterion for what studies to replicate. Although we argue that such a set of assessments is outside of the scope of the article, we recognize that to attempt to replicate an effect elicited by a poor methodological set-up is ill advised. We recommend that users of our method use their own judgment to determine whether or not an original article’s methods are sound, and to consider each experiment of their final filtered sample in turn. If the methods of the final sample of potential targets is difficult for a user to assess (for example, perhaps one ends up with two targets using highly technical methods that the typical user may be unfamiliar with), the user may want to limit themselves to those studies for which they are confident assessing the soundness of the chosen methodology.
A practical yet somewhat philosophical argument must be raised of how one might use the Bayesian reanalysis to prioritize replication targets. The reader critical of Bayes factors may suggest that no matter what classification one uses (Jeffreys or otherwise), Bayes factors still do not provide a complete measure of the information contained in a given original study. This reader would be right, though this can be said for any currently used quantification approach. We stress that we are not advertising the Bayesian reanalysis as the only route to a search for replication targets. We argue that it is a tool one can apply to reveal valuable information to use to distinguish between pro-null evidence and ambiguous study results. In this demonstration, it was valuable as a kind of centrifuge – filtering the studies into different ‘weight’ categories based on the evidence from the results, which helps us determine which studies should be replicated first. The Bayesian reanalysis can be conducted relatively easily for most interested users with the statistical software R, using the code we have provided on our OSF page https://doi.org/10.17605/OSF.IO/3RF8B , to reduce the amount of potential replication targets, allowing individuals to direct their resources in a manner based on a justifiable and systematic method.
In this paper, we have chosen to have our statistical considerations be guided by the strength of evidence for the existence of an effect. Strong evidence can result from a large sample drawn from a relatively modest true effect, or a modest sample drawn from a large true effect. Other criteria are conceivable, such as those based on the precision of the effect size estimate.
A final important consideration for the reader concerns the role of publication bias in the pool of potential targets, and therefore final target selection. The work of Etz and Vandekerckhove ( 2016 ) suggests that if one were to take all studies as the possible pool of targets (that is, take publication bias into account), the average effect size will be smaller, and, presumably, the pool of viable targets much larger. Although their results suggest that an estimate of average strength of evidence based on published results is an overestimate, under the assumptions that (1) a single study has not been replicated many times in the same lab and only the most compelling result reported; and (2) a single study has not been duplicated exactly somewhere else in the world but was never reported; the reported test statistic can be safely reanalyzed in the way we have in our paper.
Aside from this, to date over 200 academic journals use the registered report format (for an up-to-date figure, see https://cos.io/rr/ ), and the number is steadily climbing. We consider it likely that as time passes and more people take advantage of this submission format, publication bias prevalence will decrease.
We would like to stress that the articles discussed in detail in this study were selected for illustration purposes only. The demonstration serves as proof of concept, and by no means aims to criticize specific studies or question their veracity. In fact, one of the three articles has two OSF badges (for more information see https://cos.io/our-services/open-science-badges-details/ ): one for open data, and one for open materials, indicating that the authors have made their data and study materials openly available on their project’s OSF page. One of the other articles has the badge for open materials. The third article has provided access to their study materials in a supplemental folder available on the Psychological Science website. Such a commitment to transparent scientific practices are associated with research that is of higher quality, and therefore likely to be more reproducible (see the OSF badge page: https://cos.io/our-services/open-science-badges-details/ for a discussion).
The current debate over poor reproducibility in psychology has led to a number of new ideas for how to improve our research going forward. Increased numbers of replication studies is one such advance, which has been taken up wholeheartedly by many concerned researchers. While such an initiative marks a positive and constructive move toward remedying a serious problem in our field, it is neither efficient nor useful to replicate results randomly. In this article, we have argued for and demonstrated an approach which is methodical and systematic, supplemented by careful and defensible qualitative analysis toward the selection of replication targets.
The approach we advocate and apply in this article can be simple and relatively fast to conduct, and affords the user access to important information about the strength of evidence contained in a published study. Although efficient, this approach has the potential to maximize the impact of the outcomes of those replications, and minimize the waste of resources that could result from a haphazard approach to replication. Combining a quantitative reanalysis with a qualitative assessment process of a large group of potential replication targets in a simple approach such as the one presented in this paper, allows the information of multiple sources to prioritize replication targets, and can assist in refining the methodology of the replication study.
Table showing details of each reanalyzed result, and relevant information associated with each article. A full spreadsheet of all information can be found at the project’s OSF page https://doi.org/10.17605/OSF.IO/3RF8B .
Article . | Result . | Authors . | Year . | T . | DF . | Overall N . | p-value reported . | BF(10) . | Evidence Tier . |
---|---|---|---|---|---|---|---|---|---|
a | Ding et al. | 2015 | 4.42 | 40 | 42 | <.001 | 283.12 | 1 | |
b | Ding et al. | 2015 | 3.49 | 40 | 42 | <.001 | 27 | 1 | |
Metcalfe et al. | 2015 | 7.28 | 87 | 89 | <.0001 | 106765637.21 | 1 | ||
a | Reinhart et al. | 2.605 | 17 | 18 | 0.018 | 3.19 | 1 | ||
a | Fan et al. | 2015 | 2.81 | 46 | 48 | 0.007 | 6.25 | 1 | |
b | Fan et al. | 2015 | 2.51 | 46 | 48 | 0.016 | 3.44 | 1 | |
a | Schroeder et al. | 2015 | 3.79 | 157 | 160 | <.01 | 213 | 1 | |
a | Mackey et al. | 2015 | 4.4 | 56 | 58 | 0.0001 | 412.32 | 1 | |
b | Mackey et al. | 2015 | 4.7 | 56 | 58 | <.0001 | 1030.14 | 1 | |
a | Dai et al. | 2015 | 2.47 | 214 | 216 | 0.01 | 5.05 | 1 | |
a | Okonofua et al. | 2015 | 4.06 | 23 | 25 | <.001 | 139.62 | 1 | |
b | Okonofua et al. | 2015 | –4.99 | 23 | 25 | <.001 | 1158.7 | 1 | |
b | Olson et al. | 2015 | 4.3 | 16 | 17 | 0.001 | 126.81 | 1 | |
a | Olson et al. | 2015 | 3.89 | 29 | 30 | 0.001 | 114.57 | 1 | |
c | Olson et al. | 2015 | 6.75 | 29 | 30 | <.001 | 151537.61 | 1 | |
a | Yin et al. | 2015 | 5.73 | 15 | 16 | <.001 | 644.57 | 1 | |
b | Yin et al. | 2015 | 3.23 | 15 | 16 | 0.006 | 8.84 | 1 | |
d | Yin et al. | 2015 | 5.88 | 15 | 16 | <.001 | 1646.23 | 1 | |
e | Yin et al. | 2015 | 2.59 | 15 | 16 | 0.021 | 6 | 1 | |
f | Yin et al. | 2015 | 2.84 | 15 | 16 | 0.012 | 9.07 | 1 | |
Storm et al. | 2015 | 3.23 | 19 | 20 | 0.004 | 10.21 | 1 | ||
Perilloux et al. | 2015 | 7.49 | 482 | 484 | <.001 | 18931144326.12 | 1 | ||
Porter et al. | 2016 | 2.89 | 85 | 88 | 0.005 | 7.91 | 1 | ||
Skinner et al. | 2016 | 4.25 | 66 | 67 | <.001 | 297.55 | 1 | ||
Kirk et al. | 2016 | 3.59 | 43.35 | 54 | 0.001 | 40.35 | 1 | ||
a | Cooney et al. | 2016 | 3.76 | 29 | 30 | 0.001 | 83.98 | 1 | |
b | Cooney et al. | 2016 | 4.27 | 57 | 59 | <.001 | 285.37 | 1 | |
c | Cooney et al. | 2016 | 6.83 | 149 | 150 | <.001 | 42432905.55 | 1 | |
Zhou et al. | 2016 | 7.26 | 70 | 73 | <.001 | 19638415.24 | 1 | ||
b | Saint-Aubin et al. | 2016 | 6.02 | 34 | 35 | <.0001 | 21066.77 | 1 | |
a | Saint-Aubin et al. | 2016 | 5.6 | 45 | 46 | <.0001 | 13805.45 | 1 | |
a | Li et al. | 2016 | 4.08 | 22 | 24 | 0.0005 | 53.71 | 1 | |
b | Li et al. | 2016 | 3.86 | 26 | 28 | 0.00068 | 38.61 | 1 | |
Sloman et al. | 2016 | –3.4 | 69 | 70 | 0.001 | 22.86 | 1 | ||
b | Picci et al. | 2016 | 2.8 | 27 | 28 | 0.001 | 10.5 | 1 | |
c | Picci et al. | 2016 | 4.4 | 27 | 28 | 0.001 | 273.96 | 1 | |
d | Picci et al. | 2016 | 3.14 | 29 | 30 | 0 | 20.56 | 1 | |
Madore et al. | 2015 | 2.49 | 22 | 23 | 0.021 | 2.67 | 3 | ||
b | Reinhart et al. | 2.318 | 17 | 18 | 0.033 | 1.99 | 3 | ||
c | Reinhart et al. | 2.326 | 17 | 18 | 0.033 | 2.02 | 3 | ||
d | Reinhart et al. | 2.263 | 17 | 18 | 0.04 | 1.83 | 3 | ||
e | Reinhart et al. | 2.466 | 17 | 18 | 0.027 | 2.53 | 3 | ||
b | Schroeder et al. | 2015 | 2.09 | 215 | 218 | 0.04 | 1.14 | 3 | |
b | Dai et al. | 2015 | 2 | 211 | 213 | 0.047 | 0.97 | 3 | |
c | Yin et al. | 2015 | 2.47 | 15 | 16 | 0.026 | 2.52 | 3 | |
a | Kupor et al. | 2.21 | 59 | 61 | 0.031 | 1.96 | 3 | ||
c | Kupor et al. | 2.22 | 98 | 100 | 0.029 | 1.83 | 3 | ||
b | Kupor et al. | 2.27 | 200 | 202 | 0.024 | 1.68 | 3 | ||
Farooqui et al. | 2015 | 2.2 | 20 | 21 | 0.04 | 1.64 | 3 | ||
Olsson et al. | 2016 | 2.44 | 97 | 100 | 0.02 | 2.85 | 3 | ||
Watson-Jones et al. | 2016 | 2.05 | 86 | 88 | 0.043 | 1.38 | 3 | ||
Hung et al. | 2016 | –2.51 | 19 | 20 | 0.02 | 2.75 | 3 | ||
b | Hsee et al. | 2016 | 2.35 | 17 | 20 | <.031 | 2.37 | 3 | |
a | Hsee et al. | 2016 | 2.25 | 52 | 54 | 0.029 | 2.13 | 3 | |
Constable et al. | 2016 | 2.1 | 35 | 38 | 0.04 | 1.7 | 3 | ||
Chen et al. | 2016 | 2.39 | 187 | 189 | 0.018 | 2.21 | 3 | ||
a | Picci et al. | 2016 | 2.25 | 29 | 30 | 0.032 | 2.15 | 3 |
The database including all article information and reanalyzed Bayes factors are available, along with the analysis and plot R scripts, on the project’s OSF page: https://doi.org/10.17605/OSF.IO/3RF8B .
For a more detailed primer on the Bayes factor, please see Appendix A in Field and colleagues ( 2016 ); for a full expose, see Etz and Vandekerckhove ( 2018 ).
We note that some of Mackey’s guidelines lead to subjective decisions about what is theoretically relevant and important. What may be theoretically important in one field, may not be worth investigating in another, and so it is vital to consider the context of a potential replication target, and root one’s judgments in quantifiable argumentation.
Bayes factors can show evidential strength in favor of an alternative hypothesis (denoted BF 10 ), or be inverted and show support for the null hypothesis (denoted BF 01 ). In this article, we only discuss Bayes factors in terms of their support of the alternative, and so refrain from using the specific subscript notation or verbal indication.
We originally planned to consider those articles with the smallest Bayes factors, however, as we discuss later, there are many results with similar Bayes factors (e.g., 1.64, 1.68 and 1.70), which makes that choice alone somewhat arbitrary.
More complicated approaches to handle the case of multiple studies in a single paper corroborating a certain claim in the manuscript exist, for instance through a Bayesian model-averaged meta-analysis.
We only target these articles to practically demonstrate how our approach can be used. We do not imply that they are of low veracity or that the results were obtained by questionable means.
Of course, the replication as described here would need to feature different risk-taking activities, as aged persons may be averse in general to activities such as skydiving.
The authors have no competing interests to declare.
DvR and RH conceived of the idea of reanalyzing Bayes factors to quantify evidential strength of original article results; SMF conceived of the qualitative analysis, and overall process
SMF extracted all article information which formed the data file analyzed in the study
DvR wrote the code for the reanalysis phase. SMF analyzed and interpreted the findings derived from it; DvR, RH and LB refined the interpretations and plots for the final manuscript
SMF drafted the article; SMF, DvR, RH and LB further revised it
SMF approved the submitted version for publication
The author(s) of this paper chose the Open Review option, and the peer review comments are available at: http://doi.org/10.1525/collabra.218.pr
Recipient(s) will receive an email with a link to 'When and Why to Replicate: As Easy as 1, 2, 3?' and will not need an account to access the content.
Subject: When and Why to Replicate: As Easy as 1, 2, 3?
(Optional message may have a maximum of 1000 characters.)
Citing articles via
Email alerts, affiliations.
- Recent Content
- Special Collections
- All Content
- Submission Guidelines
- Publication Fees
- Journal Policies
- Editorial Team
- Online ISSN 2474-7394
- Copyright © 2024
Stay Informed
Disciplines.
- Ancient World
- Anthropology
- Communication
- Criminology & Criminal Justice
- Film & Media Studies
- Food & Wine
- Browse All Disciplines
- Browse All Courses
- Book Authors
- Booksellers
- Instructions
- Journal Authors
- Journal Editors
- Media & Journalists
- Planned Giving
About UC Press
- Press Releases
- Seasonal Catalog
- Acquisitions Editors
- Customer Service
- Exam/Desk Requests
- Media Inquiries
- Print-Disability
- Rights & Permissions
- UC Press Foundation
- © Copyright 2024 by the Regents of the University of California. All rights reserved. Privacy policy Accessibility
This Feature Is Available To Subscribers Only
Sign In or Create an Account
- A-Z Publications
Annual Review of Psychology
Volume 73, 2022, review article, replicability, robustness, and reproducibility in psychological science.
- Brian A. Nosek 1,2 , Tom E. Hardwicke 3 , Hannah Moshontz 4 , Aurélien Allard 5 , Katherine S. Corker 6 , Anna Dreber 7 , Fiona Fidler 8 , Joe Hilgard 9 , Melissa Kline Struhl 2 , Michèle B. Nuijten 10 , Julia M. Rohrer 11 , Felipe Romero 12 , Anne M. Scheel 13 , Laura D. Scherer 14 , Felix D. Schönbrodt 15 , and Simine Vazire 16
- View Affiliations Hide Affiliations Affiliations: 1 Department of Psychology, University of Virginia, Charlottesville, Virginia 22904, USA; email: [email protected] 2 Center for Open Science, Charlottesville, Virginia 22903, USA 3 Department of Psychology, University of Amsterdam, 1012 ZA Amsterdam, The Netherlands 4 Addiction Research Center, University of Wisconsin–Madison, Madison, Wisconsin 53706, USA 5 Department of Psychology, University of California, Davis, California 95616, USA 6 Psychology Department, Grand Valley State University, Allendale, Michigan 49401, USA 7 Department of Economics, Stockholm School of Economics, 113 83 Stockholm, Sweden 8 School of Biosciences, University of Melbourne, Parkville VIC 3010, Australia 9 Department of Psychology, Illinois State University, Normal, Illinois 61790, USA 10 Meta-Research Center, Tilburg University, 5037 AB Tilburg, The Netherlands 11 Department of Psychology, Leipzig University, 04109 Leipzig, Germany 12 Department of Theoretical Philosophy, University of Groningen, 9712 CP Groningen, The Netherlands 13 Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands 14 University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045, USA 15 Department of Psychology, Ludwig Maximilian University of Munich, 80539 Munich, Germany 16 School of Psychological Sciences, University of Melbourne, Parkville VIC 3052, Australia
- Vol. 73:719-748 (Volume publication date January 2022) https://doi.org/10.1146/annurev-psych-020821-114157
- First published as a Review in Advance on October 19, 2021
- Copyright © 2022 by Annual Reviews. All rights reserved
Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. For psychology, the 2010s might be characterized as a decade of active confrontation. Systematic and multi-site replication projects assessed current understandings and observed surprising failures to replicate many published findings. Replication efforts highlighted sociocultural challenges such as disincentives to conduct replications and a tendency to frame replication as a personal attack rather than a healthy scientific practice, and they raised awareness that replication contributes to self-correction. Nevertheless, innovation in doing and understanding replication and its cousins, reproducibility and robustness, has positioned psychology to improve research practices and accelerate progress.
Article metrics loading...
Full text loading...
Literature Cited
- Alogna VK , Attaya MK , Aucoin P , Bahník Š , Birch S et al. 2014 . Registered Replication Report: Schooler and Engstler-Schooler (1990). Perspect. Psychol. Sci. 9 : 5 556– 78 [Google Scholar]
- Altmejd A , Dreber A , Forsell E , Huber J , Imai T et al. 2019 . Predicting the replicability of social science lab experiments. PLOS ONE 14 : 12 e0225826 [Google Scholar]
- Anderson CJ , Bahník Š , Barnett-Cowan M , Bosco FA , Chandler J et al. 2016 . Response to Comment on “Estimating the reproducibility of psychological science. Science 351 : 6277 1037 [Google Scholar]
- Anderson MS , Martinson BC , De Vries R. 2007 . Normative dissonance in science: results from a national survey of U.S. scientists. J. Empir. Res. Hum. Res. Ethics 2 : 4 3– 14 [Google Scholar]
- Appelbaum M , Cooper H , Kline RB , Mayo-Wilson E , Nezu AM , Rao SM. 2018 . Journal article reporting standards for quantitative research in psychology: the APA Publications and Communications Board task force report. Am. Psychol. 73 : 1 3 – 25 Corrigendum 2018 . Am. Psychol 73 : 7 947 [Google Scholar]
- Armeni K , Brinkman L , Carlsson R , Eerland A , Fijten R et al. 2020 . Towards wide-scale adoption of open science practices: the role of open science communities. MetaArXiv, Oct. 6 https://doi.org/10.31222/osf.io/7gct9 [Crossref]
- Artner R , Verliefde T , Steegen S , Gomes S , Traets F et al. 2020 . The reproducibility of statistical results in psychological research: an investigation using unpublished raw data. Psychol. Methods. In press. https://doi.org/10.1037/met0000365 [Crossref] [Google Scholar]
- Baker M. 2016 . Dutch agency launches first grants programme dedicated to replication. Nat. News. https://doi.org/10.1038/nature.2016.20287 [Crossref] [Google Scholar]
- Bakker M , van Dijk A , Wicherts JM. 2012 . The rules of the game called psychological science. Perspect. Psychol. Sci. 7 : 6 543– 54 [Google Scholar]
- Bakker M , Wicherts JM. 2011 . The (mis)reporting of statistical results in psychology journals. Behav. Res. Methods 43 : 3 666– 78 [Google Scholar]
- Baribault B , Donkin C , Little DR , Trueblood JS , Oravecz Z et al. 2018 . Metastudies for robust tests of theory. PNAS 115 : 11 2607– 12 [Google Scholar]
- Baron J , Hershey JC. 1988 . Outcome bias in decision evaluation. J. Pers. Soc. Psychol. 54 : 4 569– 79 [Google Scholar]
- Baumeister RF. 2016 . Charting the future of social psychology on stormy seas: winners, losers, and recommendations. J. Exp. Soc. Psychol. 66 : 153– 58 [Google Scholar]
- Baumeister RF , Vohs KD. 2016 . Misguided effort with elusive implications. Perspect. Psychol. Sci. 11 : 4 574– 75 [Google Scholar]
- Benjamin DJ , Berger JO , Johannesson M , Nosek BA , Wagenmakers E-J et al. 2018 . Redefine statistical significance. Nat. Hum. Behav. 2 : 1 6– 10 [Google Scholar]
- Botvinik-Nezer R , Holzmeister F , Camerer CF , Dreber A , Huber J et al. 2020 . Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582 : 7810 84– 88 [Google Scholar]
- Bouwmeester S , Verkoeijen PPJL , Aczel B , Barbosa F , Bègue L et al. 2017 . Registered Replication Report: Rand, Greene, and Nowak (2012). Perspect. Psychol. Sci. 12 : 3 527– 42 [Google Scholar]
- Brown NJL , Heathers JAJ. 2017 . The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Pers. Sci. 8 : 4 363– 69 [Google Scholar]
- Button KS , Ioannidis JPA , Mokrysz C , Nosek BA , Flint J et al. 2013 . Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14 : 5 365– 76 [Google Scholar]
- Byers-Heinlein K , Bergmann C , Davies C , Frank M , Hamlin JK et al. 2020 . Building a collaborative psychological science: lessons learned from ManyBabies 1. Can. Psychol. Psychol. Can. 61 : 4 349– 63 [Google Scholar]
- Camerer CF , Dreber A , Forsell E , Ho T-H , Huber J et al. 2016 . Evaluating replicability of laboratory experiments in economics. Science 351 : 6280 1433– 36 [Google Scholar]
- Camerer CF , Dreber A , Holzmeister F , Ho T-H , Huber J et al. 2018 . Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2 : 9 637– 44 [Google Scholar]
- Carter EC , Schönbrodt FD , Gervais WM , Hilgard J. 2019 . Correcting for bias in psychology: a comparison of meta-analytic methods. Adv. Methods Pract. Psychol. Sci. 2 : 2 115– 44 [Google Scholar]
- Cent. Open Sci 2020 . APA joins as new signatory to TOP guidelines. Center for Open Science Nov. 10. https://www.cos.io/about/news/apa-joins-as-new-signatory-to-top-guidelines [Google Scholar]
- Cesario J. 2014 . Priming, replication, and the hardest science. Perspect. Psychol. Sci. 9 : 1 40– 48 [Google Scholar]
- Chambers C. 2019 . What's next for Registered Reports?. Nature 573 : 7773 187– 89 [Google Scholar]
- Cheung I , Campbell L , LeBel EP , Ackerman RA , Aykutoğlu B et al. 2016 . Registered Replication Report: Study 1 from Finkel, Rusbult, Kumashiro, & Hannon (2002). Perspect. Psychol. Sci. 11 : 5 750– 64 [Google Scholar]
- Christensen G , Wang Z , Paluck EL , Swanson N , Birke DJ , Miguel E , Littman R. 2019 . Open science practices are on the rise: the State of Social Science (3S) Survey. MetaArXiv, Oct. 18. https://doi.org/10.31222/osf.io/5rksu [Crossref]
- Christensen-Szalanski JJ , Willham CF. 1991 . The hindsight bias: a meta-analysis. Organ. Behav. Hum. Decis. Process. 48 : 1 147– 68 [Google Scholar]
- Cohen J. 1962 . The statistical power of abnormal-social psychological research: a review. J. Abnorm. Soc. Psychol. 65 : 3 145– 53 [Google Scholar]
- Cohen J. 1973 . Statistical power analysis and research results. Am. Educ. Res. J. 10 : 3 225– 29 [Google Scholar]
- Cohen J. 1990 . Things I have learned (so far). Am. Psychol. 45 : 1304– 12 [Google Scholar]
- Cohen J. 1992 . A power primer. Psychol. Bull. 112 : 1 155– 59 [Google Scholar]
- Cohen J. 1994 . The earth is round (p < .05). Am. Psychol. 49 : 12 997– 1003 [Google Scholar]
- Colling LJ , Szücs D , De Marco D , Cipora K , Ulrich R et al. 2020 . Registered Replication Report on Fischer, Castel, Dodd, and Pratt (2003). Adv. Methods Pract. Psychol. Sci 3 : 2 143– 62 [Google Scholar]
- Cook FL. 2016 . Dear Colleague Letter: robust and reliable research in the social, behavioral, and economic sciences. National Science Foundation Sept. 20. https://www.nsf.gov/pubs/2016/nsf16137/nsf16137.jsp [Google Scholar]
- Crandall CS , Sherman JW. 2016 . On the scientific superiority of conceptual replications for scientific progress. J. Exp. Soc. Psychol. 66 : 93– 99 [Google Scholar]
- Crisp RJ , Miles E , Husnu S 2014 . Support for the replicability of imagined contact effects. Soc. Psychol. 45 : 4 303– 4 [Google Scholar]
- Cronbach LJ , Meehl PE. 1955 . Construct validity in psychological tests. Psychol. Bull. 52 : 4 281– 302 [Google Scholar]
- Dang J , Barker P , Baumert A , Bentvelzen M , Berkman E et al. 2021 . A multilab replication of the ego depletion effect. Soc. Psychol. Pers. Sci. 12 : 1 14– 24 [Google Scholar]
- Devezer B , Nardin LG , Baumgaertner B , Buzbas EO. 2019 . Scientific discovery in a model-centric framework: reproducibility, innovation, and epistemic diversity. PLOS ONE 14 : 5 e0216125 [Google Scholar]
- Dijksterhuis A. 2018 . Reflection on the professor-priming replication report. Perspect. Psychol. Sci. 13 : 2 295– 96 [Google Scholar]
- Dreber A , Pfeiffer T , Almenberg J , Isaksson S , Wilson B et al. 2015 . Using prediction markets to estimate the reproducibility of scientific research. PNAS 112 : 50 15343– 47 [Google Scholar]
- Duhem PMM. 1954 . The Aim and Structure of Physical Theory Princeton, NJ: Princeton Univ. Press [Google Scholar]
- Ebersole CR , Alaei R , Atherton OE , Bernstein MJ , Brown M et al. 2017 . Observe, hypothesize, test, repeat: Luttrell, Petty and Xu (2017) demonstrate good science. J. Exp. Soc. Psychol. 69 : 184– 86 [Google Scholar]
- Ebersole CR , Atherton OE , Belanger AL , Skulborstad HM , Allen JM et al. 2016a . Many Labs 3: evaluating participant pool quality across the academic semester via replication. J. Exp. Soc. Psychol. 67 : 68– 82 [Google Scholar]
- Ebersole CR , Axt JR , Nosek BA 2016b . Scientists’ reputations are based on getting it right, not being right. PLOS Biol . 14 : 5 e1002460 [Google Scholar]
- Ebersole CR , Mathur MB , Baranski E , Bart-Plange D-J , Buttrick NR et al. 2020 . Many Labs 5: testing pre-data-collection peer review as an intervention to increase replicability. Adv. Methods Pract. Psychol. Sci. 3 : 3 309– 31 [Google Scholar]
- Eerland A , Sherrill AM , Magliano JP , Zwaan RA , Arnal JD et al. 2016 . Registered Replication Report: Hart & Albarracín (2011). Perspect. Psychol. Sci. 11 : 1 158– 71 [Google Scholar]
- Ellemers N , Fiske ST , Abele AE , Koch A , Yzerbyt V. 2020 . Adversarial alignment enables competing models to engage in cooperative theory building toward cumulative science. PNAS 117 : 14 7561– 67 [Google Scholar]
- Epskamp S , Nuijten MB. 2018 . Statcheck: extract statistics from articles and recompute p values. Statistical Software https://CRAN.R-project.org/package=statcheck [Google Scholar]
- Errington TM , Denis A , Perfito N , Iorns E , Nosek BA 2021 . Challenges for assessing reproducibility and replicability in preclinical cancer biology. eLife In press [Google Scholar]
- Etz A , Vandekerckhove J. 2016 . A Bayesian perspective on the reproducibility project: psychology. PLOS ONE 11 : 2 e0149794 [Google Scholar]
- Fanelli D. 2010 .. “ Positive” results increase down the hierarchy of the sciences. PLOS ONE 5 : 4 e10068 [Google Scholar]
- Fanelli D. 2012 . Negative results are disappearing from most disciplines and countries. Scientometrics 90 : 3 891– 904 [Google Scholar]
- Feest U. 2019 . Why replication is overrated. Philos. Sci. 86 : 5 895– 905 [Google Scholar]
- Ferguson MJ , Carter TJ , Hassin RR. 2014 . Commentary on the attempt to replicate the effect of the American flag on increased Republican attitudes. Soc. Psychol. 45 : 4 301– 2 [Google Scholar]
- Fetterman AK , Sassenberg K. 2015 . The reputational consequences of failed replications and wrongness admission among scientists. PLOS ONE 10 : 12 e0143723 [Google Scholar]
- Forsell E , Viganola D , Pfeiffer T , Almenberg J , Wilson B et al. 2019 . Predicting replication outcomes in the Many Labs 2 study. J. Econ. Psychol. 75 : 102117 [Google Scholar]
- Franco A , Malhotra N , Simonovits G. 2014 . Publication bias in the social sciences: unlocking the file drawer. Science 345 : 6203 1502– 5 [Google Scholar]
- Franco A , Malhotra N , Simonovits G. 2016 . Underreporting in psychology experiments: evidence from a study registry. Soc. Psychol. Pers. Sci. 7 : 1 8– 12 [Google Scholar]
- Frank MC , Bergelson E , Bergmann C , Cristia A , Floccia C et al. 2017 . A collaborative approach to infant research: promoting reproducibility, best practices, and theory-building. Infancy 22 : 4 421– 35 [Google Scholar]
- Funder DC , Ozer DJ. 2019 . Evaluating effect size in psychological research: sense and nonsense. Adv. Methods Pract. Psychol. Sci. 2 : 2 156– 68 [Google Scholar]
- Gelman A , Carlin J. 2014 . Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9 : 6 641– 51 [Google Scholar]
- Gelman A , Loken E. 2013 . The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time Work. Pap., Columbia Univ. New York: [Google Scholar]
- Gergen KJ. 1973 . Social psychology as history. J. Pers. Soc. Psychol. 26 : 2 309– 20 [Google Scholar]
- Gervais WM , Jewell JA , Najle MB , Ng BKL. 2015 . A powerful nudge? Presenting calculable consequences of underpowered research shifts incentives toward adequately powered designs. Soc. Psychol. Pers. Sci. 6 : 7 847– 54 [Google Scholar]
- Ghelfi E , Christopherson CD , Urry HL , Lenne RL , Legate N et al. 2020 . Reexamining the effect of gustatory disgust on moral judgment: a multilab direct replication of Eskine, Kacinik, and Prinz (2011). Adv. Methods Pract. Psychol. Sci. 3 : 1 3– 23 [Google Scholar]
- Gilbert DT , King G , Pettigrew S , Wilson TD 2016 . Comment on “Estimating the reproducibility of psychological science. Science 351 : 6277 1037 [Google Scholar]
- Giner-Sorolla R. 2012 . Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect. Psychol. Sci. 7 : 6 562– 71 [Google Scholar]
- Giner-Sorolla R. 2019 . From crisis of evidence to a “crisis” of relevance? Incentive-based answers for social psychology's perennial relevance worries. Eur. Rev. Soc. Psychol. 30 : 1 1– 38 [Google Scholar]
- Gollwitzer M. 2020 . DFG Priority Program SPP 2317 Proposal: A meta-scientific program to analyze and optimize replicability in the behavioral, social, and cognitive sciences (META-REP). PsychArchives, May 29. http://dx.doi.org/10.23668/psycharchives.3010 [Crossref]
- Gordon M , Viganola D , Bishop M , Chen Y , Dreber A et al. 2020 . Are replication rates the same across academic fields? Community forecasts from the DARPA SCORE programme. R. Soc. Open Sci. 7 : 7 200566 [Google Scholar]
- Götz M , O'Boyle EH , Gonzalez-Mulé E , Banks GC , Bollmann SS 2020 . The “Goldilocks Zone”: (Too) many confidence intervals in tests of mediation just exclude zero. Psychol. Bull. 147 : 1 95– 114 [Google Scholar]
- Greenwald AG. 1975 . Consequences of prejudice against the null hypothesis. Psychol. Bull. 82 : 1 1– 20 [Google Scholar]
- Hagger MS , Chatzisarantis NLD , Alberts H , Anggono CO , Batailler C et al. 2016 . A multilab preregistered replication of the ego-depletion effect. Perspect. Psychol. Sci. 11 : 4 546– 73 [Google Scholar]
- Hanea AM , McBride MF , Burgman MA , Wintle BC , Fidler F et al. 2017 . I nvestigate D iscuss E stimate A ggregate for structured expert judgement. Int. J. Forecast. 33 : 1 267– 79 [Google Scholar]
- Hardwicke TE , Bohn M , MacDonald KE , Hembacher E , Nuijten MB et al. 2021 . Analytic reproducibility in articles receiving open data badges at the journal Psychological Science : an observational study. R. Soc. Open Sci. 8 : 1 201494 [Google Scholar]
- Hardwicke TE , Mathur MB , MacDonald K , Nilsonne G , Banks GC et al. 2018 . Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition . R. Soc. Open Sci. 5 : 8 180448 [Google Scholar]
- Hardwicke TE , Serghiou S , Janiaud P , Danchev V , Crüwell S et al. 2020a . Calibrating the scientific ecosystem through meta-research. Annu. Rev. Stat. Appl. 7 : 11– 37 [Google Scholar]
- Hardwicke TE , Thibault RT , Kosie JE , Wallach JD , Kidwell M , Ioannidis J. 2020b . Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). MetaArXiv, Jan. 2. https://doi.org/10.31222/osf.io/9sz2y [Crossref]
- Hedges LV , Schauer JM. 2019 . Statistical analyses for studying replication: meta-analytic perspectives. Psychol. Methods 24 : 5 557– 70 [Google Scholar]
- Hoogeveen S , Sarafoglou A , Wagenmakers E-J. 2020 . Laypeople can predict which social-science studies will be replicated successfully. Adv. Methods Pract. Psychol. Sci. 3 : 3 267– 85 [Google Scholar]
- Hughes BM. 2018 . Psychology in Crisis London: Palgrave Macmillan [Google Scholar]
- Inbar Y. 2016 . Association between contextual dependence and replicability in psychology may be spurious. PNAS 113 : 34 E4933– 34 [Google Scholar]
- Ioannidis JPA. 2005 . Why most published research findings are false. PLOS Med 2 : 8 e124 [Google Scholar]
- Ioannidis JPA. 2008 . Why most discovered true associations are inflated. Epidemiology 19 : 5 640– 48 [Google Scholar]
- Ioannidis JPA. 2014 . How to make more published research true. PLOS Med 11 : 10 e1001747 [Google Scholar]
- Ioannidis JPA , Trikalinos TA. 2005 . Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. J. Clin. Epidemiol. 58 : 6 543– 49 [Google Scholar]
- Isager PM , van Aert RCM , Bahník Š , Brandt M , DeSoto KA et al. 2020 . Deciding what to replicate: A formal definition of “replication value” and a decision model for replication study selection. MetaArXiv, Sept. 2. https://doi.org/10.31222/osf.io/2gurz [Crossref]
- John LK , Loewenstein G , Prelec D. 2012 . Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23 : 5 524– 32 [Google Scholar]
- Kahneman D. 2003 . Experiences of collaborative research. Am. Psychol. 58 : 9 723– 30 [Google Scholar]
- Kerr NL. 1998 . HARKing: Hypothesizing after the results are known. Pers. Soc. Psychol. Rev. 2 : 3 196– 217 [Google Scholar]
- Kidwell MC , Lazarević LB , Baranski E , Hardwicke TE , Piechowski S et al. 2016 . Badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency. PLOS Biol 14 : 5 e1002456 [Google Scholar]
- Klein RA , Cook CL , Ebersole CR , Vitiello C , Nosek BA et al. 2019 . Many Labs 4: failure to replicate mortality salience effect with and without original author involvement. PsyArXiv, Dec. 11. https://doi.org/10/ghwq2w [Crossref]
- Klein RA , Ratliff KA , Vianello M , Adams RB , Bahník Š et al. 2014 . Investigating variation in replicability: a “many labs” replication project. Soc. Psychol. 45 : 3 142– 52 [Google Scholar]
- Klein RA , Vianello M , Hasselman F , Adams BG , Adams RB et al. 2018 . Many Labs 2: investigating variation in replicability across samples and settings. Adv. Methods Pract. Psychol. Sci. 1 : 4 443– 90 [Google Scholar]
- Kunda Z. 1990 . The case for motivated reasoning. Psychol. Bull. 108 : 3 480– 98 [Google Scholar]
- Lakens D. 2019 . The value of preregistration for psychological science: a conceptual analysis. PsyArXiv, Nov. 18. https://doi.org/10.31234/osf.io/jbh4w [Crossref]
- Lakens D , Adolfi FG , Albers CJ , Anvari F , Apps MA et al. 2018 . Justify your alpha. Nat. Hum. Behav. 2 : 3 168– 71 [Google Scholar]
- Landy JF , Jia ML , Ding IL , Viganola D , Tierney W et al. 2020 . Crowdsourcing hypothesis tests: making transparent how design choices shape research results. Psychol. Bull. 146 : 5 451– 79 [Google Scholar]
- Leary MR , Diebels KJ , Davisson EK , Jongman-Sereno KP , Isherwood JC et al. 2017 . Cognitive and interpersonal features of intellectual humility. Pers. Soc. Psychol. Bull. 43 : 6 793– 813 [Google Scholar]
- LeBel EP , McCarthy RJ , Earp BD , Elson M , Vanpaemel W. 2018 . A unified framework to quantify the credibility of scientific findings. Adv. Methods Pract. Psychol. Sci. 1 : 3 389– 402 [Google Scholar]
- Leighton DC , Legate N , LePine S , Anderson SF , Grahe J 2018 . Self-esteem, self-disclosure, self-expression, and connection on Facebook: a collaborative replication meta-analysis. Psi Chi J. Psychol. Res. 23 : 2 98– 109 [Google Scholar]
- Leising D , Thielmann I , Glöckner A , Gärtner A , Schönbrodt F. 2020 . Ten steps toward a better personality science—how quality may be rewarded more in research evaluation. PsyArXiv, May 31. https://doi.org/10.31234/osf.io/6btc3 [Crossref]
- Leonelli S 2018 . Rethinking reproducibility as a criterion for research quality. Research in the History of Economic Thought and Methodology 36 L Fiorito, S Scheall, CE Suprinyak 129– 46 Bingley, UK: Emerald [Google Scholar]
- Lewandowsky S , Oberauer K. 2020 . Low replicability can support robust and efficient science. Nat. Commun. 11 : 1 358 [Google Scholar]
- Maassen E , van Assen MALM , Nuijten MB , Olsson-Collentine A , Wicherts JM. 2020 . Reproducibility of individual effect sizes in meta-analyses in psychology. PLOS ONE 15 : 5 e0233107 [Google Scholar]
- Machery E. 2020 . What is a replication?. Philos. Sci. 87 : 4 545 – 67 [Google Scholar]
- ManyBabies Consort 2020 . Quantifying sources of variability in infancy research using the infant-directed-speech preference. Adv. Methods Pract. Psychol. Sci. 3 : 1 24– 52 [Google Scholar]
- Marcus A , Oransky I 2018 . Meet the “data thugs” out to expose shoddy and questionable research. Science Feb. 18. https://www.sciencemag.org/news/2018/02/meet-data-thugs-out-expose-shoddy-and-questionable-research [Google Scholar]
- Marcus A , Oransky I. 2020 . Tech firms hire “Red Teams.” Scientists should, too. WIRED July 16. https://www.wired.com/story/tech-firms-hire-red-teams-scientists-should-too/ [Google Scholar]
- Mathur MB , VanderWeele TJ. 2020 . New statistical metrics for multisite replication projects. J. R. Stat. Soc. A 183 : 3 1145– 66 [Google Scholar]
- Maxwell SE. 2004 . The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychol. Methods 9 : 2 147– 63 [Google Scholar]
- Maxwell SE , Lau MY , Howard GS. 2015 . Is psychology suffering from a replication crisis? What does “failure to replicate” really mean?. Am. Psychol. 70 : 6 487– 98 [Google Scholar]
- Mayo DG. 2018 . Statistical Inference as Severe Testing Cambridge, UK: Cambridge Univ. Press [Google Scholar]
- McCarthy R , Gervais W , Aczel B , Al-Kire R , Baraldo S et al. 2021 . A multi-site collaborative study of the hostile priming effect. Collabra Psychol 7 : 1 18738 [Google Scholar]
- McCarthy RJ , Hartnett JL , Heider JD , Scherer CR , Wood SE et al. 2018 . An investigation of abstract construal on impression formation: a multi-lab replication of McCarthy and Skowronski (2011). Int. Rev. Soc. Psychol. 31 : 1 15 [Google Scholar]
- Meehl PE. 1978 . Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. J. Consult. Clin. Psychol. 46 : 4 806– 34 [Google Scholar]
- Meyer MN , Chabris C. 2014 . Why psychologists' food fight matters. Slate Magazine July 31. https://slate.com/technology/2014/07/replication-controversy-in-psychology-bullying-file-drawer-effect-blog-posts-repligate.html [Google Scholar]
- Mischel W. 2008 . The toothbrush problem. APS Observer Dec. 1. https://www.psychologicalscience.org/observer/the-toothbrush-problem [Google Scholar]
- Moran T , Hughes S , Hussey I , Vadillo MA , Olson MA et al. 2020 . Incidental attitude formation via the surveillance task: a Registered Replication Report of Olson and Fazio (2001). PsyArXiv, April 17. https://doi.org/10/ghwq2z [Crossref]
- Moshontz H , Campbell L , Ebersole CR , IJzerman H , Urry HL et al. 2018 . The Psychological Science Accelerator: advancing psychology through a distributed collaborative network. Adv. Methods Pract. Psychol. Sci. 1 : 4 501– 15 [Google Scholar]
- Munafò MR , Chambers CD , Collins AM , Fortunato L , Macleod MR. 2020 . Research culture and reproducibility. Trends Cogn. Sci. 24 : 2 91– 93 [Google Scholar]
- Muthukrishna M , Henrich J. 2019 . A problem in theory. Nat. Hum. Behav. 3 : 3 221– 29 [Google Scholar]
- Natl. Acad. Sci. Eng. Med 2019 . Reproducibility and Replicability in Science Washington, DC: Natl. Acad. Press [Google Scholar]
- Nelson LD , Simmons J , Simonsohn U. 2018 . Psychology's renaissance. Annu. Rev. Psychol. 69 : 511– 34 [Google Scholar]
- Nickerson RS. 1998 . Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2 : 2 175– 220 [Google Scholar]
- Nosek B. 2019a . Strategy for culture change. Center for Open Science June 11. https://www.cos.io/blog/strategy-for-culture-change [Google Scholar]
- Nosek B. 2019b . The rise of open science in psychology, a preliminary report. Center for Open Science June 3. https://www.cos.io/blog/rise-open-science-psychology-preliminary-report [Google Scholar]
- Nosek BA , Alter G , Banks GC , Borsboom D , Bowman SD et al. 2015 . Promoting an open research culture. Science 348 : 6242 1422– 25 [Google Scholar]
- Nosek BA , Beck ED , Campbell L , Flake JK , Hardwicke TE et al. 2019 . Preregistration is hard, and worthwhile. Trends Cogn. Sci. 23 : 10 815– 18 [Google Scholar]
- Nosek BA , Ebersole CR , DeHaven AC , Mellor DT. 2018 . The preregistration revolution. PNAS 115 : 11 2600– 6 [Google Scholar]
- Nosek BA , Errington TM. 2020a . What is replication?. PLOS Biol 18 : 3 e3000691 [Google Scholar]
- Nosek BA , Errington TM. 2020b . The best time to argue about what a replication means? Before you do it. Nature 583 : 7817 518– 20 [Google Scholar]
- Nosek BA , Gilbert EA. 2017 . Mischaracterizing replication studies leads to erroneous conclusions. PsyArXiv, April 18. https://doi.org/10.31234/osf.io/nt4d3 [Crossref]
- Nosek BA , Spies JR , Motyl M. 2012 . Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. On Psychol. Sci. 7 : 6 615– 31 [Google Scholar]
- Nuijten MB , Bakker M , Maassen E , Wicherts JM. 2018 . Verify original results through reanalysis before replicating. Behav. Brain Sci. 41 : e143 [Google Scholar]
- Nuijten MB , Hartgerink CHJ , van Assen MALM , Epskamp S , Wicherts JM 2016 . The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48 : 4 1205– 26 [Google Scholar]
- Nuijten MB , van Assen MA , Veldkamp CL , Wicherts JM. 2015 . The replication paradox: Combining studies can decrease accuracy of effect size estimates. Rev. Gen. Psychol. 19 : 2 172– 82 [Google Scholar]
- O'Donnell M , Nelson LD , Ackermann E , Aczel B , Akhtar A et al. 2018 . Registered Replication Report: Dijksterhuis and van Knippenberg (1998). Perspect. Psychol. Sci. 13 : 2 268– 94 [Google Scholar]
- Olsson-Collentine A , Wicherts JM , van Assen MALM. 2020 . Heterogeneity in direct replications in psychology and its association with effect size. Psychol. Bull. 146 : 10 922– 40 [Google Scholar]
- Open Sci. Collab 2015 . Estimating the reproducibility of psychological science. Science 349 : 6251 aac4716 [Google Scholar]
- Patil P , Peng RD , Leek JT. 2016 . What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11 : 4 539– 44 [Google Scholar]
- Pawel S , Held L. 2020 . Probabilistic forecasting of replication studies. PLOS ONE 15 : 4 e0231416 [Google Scholar]
- Perugini M , Gallucci M , Costantini G. 2014 . Safeguard power as a protection against imprecise power estimates. Perspect. Psychol. Sci. 9 : 3 319– 32 [Google Scholar]
- Protzko J , Krosnick J , Nelson LD , Nosek BA , Axt J et al. 2020 . High replicability of newly-discovered social-behavioral findings is achievable. PsyArXiv, Sept. 10. https://doi.org/10.31234/osf.io/n2a9x [Crossref]
- Rogers EM. 2003 . Diffusion of Innovations New York: Free Press, 5th ed.. [Google Scholar]
- Romero F. 2017 . Novelty versus replicability: virtues and vices in the reward system of science. Philos. Sci. 84 : 5 1031– 43 [Google Scholar]
- Rosenthal R. 1979 . The file drawer problem and tolerance for null results. Psychol. Bull. 86 : 3 638– 41 [Google Scholar]
- Rothstein HR , Sutton AJ , Borenstein M 2005 . Publication bias in meta-analysis. Publication Bias in Meta-Analysis: Prevention, Assessment and Adjustments HR Rothstein, AJ Sutton, M Borenstein 1– 7 Chichester, UK: Wiley & Sons [Google Scholar]
- Rouder JN. 2016 . The what, why, and how of born-open data. Behav. Res. Methods 48 : 3 1062– 69 [Google Scholar]
- Scheel AM , Schijen M , Lakens D. 2020 . An excess of positive results: comparing the standard psychology literature with Registered Reports. PsyArXiv, Febr. 5. https://doi.org/10.31234/osf.io/p6e9c [Crossref]
- Schimmack U. 2012 . The ironic effect of significant results on the credibility of multiple-study articles. Psychol. Methods 17 : 4 551– 66 [Google Scholar]
- Schmidt S. 2009 . Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psychol. 13 : 2 90– 100 [Google Scholar]
- Schnall S 2014 . Commentary and rejoinder on Johnson, Cheung, and Donnellan (2014a). Clean data: Statistical artifacts wash out replication efforts. Soc. Psychol. 45 : 4 315– 17 [Google Scholar]
- Schwarz N , Strack F. 2014 . Does merely going through the same moves make for a “direct” replication? Concepts, contexts, and operationalizations. Soc. Psychol. 45 : 4 305– 6 [Google Scholar]
- Schweinsberg M , Madan N , Vianello M , Sommer SA , Jordan J et al. 2016 . The pipeline project: pre-publication independent replications of a single laboratory's research pipeline. J. Exp. Soc. Psychol. 66 : 55– 67 [Google Scholar]
- Sedlmeier P , Gigerenzer G. 1992 . Do studies of statistical power have an effect on the power of studies?. Psychol. Bull. 105 : 2 309– 16 [Google Scholar]
- Shadish WR , Cook TD , Campbell DT 2002 . Experimental and Quasi-Experimental Designs for Generalized Causal Inference Boston: Houghton Mifflin [Google Scholar]
- Shiffrin RM , Börner K , Stigler SM. 2018 . Scientific progress despite irreproducibility: a seeming paradox. PNAS 115 : 11 2632– 39 [Google Scholar]
- Shih M , Pittinsky TL 2014 . Reflections on positive stereotypes research and on replications. Soc. Psychol. 45 : 4 335– 38 [Google Scholar]
- Silberzahn R , Uhlmann EL , Martin DP , Anselmi P , Aust F et al. 2018 . Many analysts, one data set: making transparent how variations in analytic choices affect results. Adv. Methods Pract. Psychol. Sci. 1 : 3 337– 56 [Google Scholar]
- Simmons JP , Nelson LD , Simonsohn U 2011 . False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22 : 11 1359– 66 [Google Scholar]
- Simons DJ. 2014 . The value of direct replication. Perspect. Psychol. Sci. 9 : 1 76– 80 [Google Scholar]
- Simons DJ , Shoda Y , Lindsay DS. 2017 . Constraints on generality (COG): a proposed addition to all empirical papers. Perspect. Psychol. Sci. 12 : 6 1123– 28 [Google Scholar]
- Simonsohn U. 2015 . Small telescopes: detectability and the evaluation of replication results. Psychol. Sci. 26 : 5 559– 69 [Google Scholar]
- Simonsohn U , Simmons JP , Nelson LD. 2020 . Specification curve analysis. Nat. Hum. Behav. 4 : 1208– 14 [Google Scholar]
- Smaldino PE , McElreath R. 2016 . The natural selection of bad science. R. Soc. Open Sci. 3 : 9 160384 [Google Scholar]
- Smith PL , Little DR. 2018 . Small is beautiful: in defense of the small-N design. Psychon. Bull. Rev. 25 : 6 2083– 101 [Google Scholar]
- Soderberg CK. 2018 . Using OSF to share data: a step-by-step guide. Adv. Methods Pract. Psychol. Sci. 1 : 1 115– 20 [Google Scholar]
- Soderberg CK , Errington T , Schiavone SR , Bottesini JG , Thorn FS et al. 2021 . Initial evidence of research quality of Registered Reports compared with the standard publishing model. Nat. Hum. Behav 5 : 8 990 – 97 [Google Scholar]
- Soto CJ. 2019 . How replicable are links between personality traits and consequential life outcomes? The life outcomes of personality replication project. Psychol. Sci. 30 : 5 711– 27 [Google Scholar]
- Spellman BA. 2015 . A short (personal) future history of revolution 2.0. Perspect. Psychol. Sci. 10 : 6 886– 99 [Google Scholar]
- Steegen S , Tuerlinckx F , Gelman A , Vanpaemel W 2016 . Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11 : 5 702– 12 [Google Scholar]
- Sterling TD. 1959 . Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J. Am. Stat. Assoc. 54 : 285 30– 34 [Google Scholar]
- Sterling TD , Rosenbaum WL , Weinkam JJ. 1995 . Publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. Am. Stat. 49 : 108– 12 [Google Scholar]
- Stroebe W , Strack F. 2014 . The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9 : 1 59– 71 [Google Scholar]
- Szucs D , Ioannidis JPA. 2017 . Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biol 15 : 3 e2000797 [Google Scholar]
- Tiokhin L , Derex M. 2019 . Competition for novelty reduces information sampling in a research game—a registered report. R. Soc. Open Sci. 6 : 5 180934 [Google Scholar]
- Van Bavel JJ , Mende-Siedlecki P , Brady WJ , Reinero DA 2016 . Contextual sensitivity in scientific reproducibility. PNAS 113 : 23 6454– 59 [Google Scholar]
- Vazire S. 2018 . Implications of the credibility revolution for productivity, creativity, and progress. Perspect. Psychol. Sci. 13 : 4 411– 17 [Google Scholar]
- Vazire S , Schiavone SR , Bottesini JG. 2020 . Credibility beyond replicability: improving the four validities in psychological science. PsyArXiv, Oct. 7. https://doi.org/10.31234/osf.io/bu4d3 [Crossref]
- Verhagen J , Wagenmakers E-J. 2014 . Bayesian tests to quantify the result of a replication attempt. J. Exp. Psychol. Gen. 143 : 4 1457– 75 [Google Scholar]
- Verschuere B , Meijer EH , Jim A , Hoogesteyn K , Orthey R et al. 2018 . Registered Replication Report on Mazar, Amir, and Ariely (2008). Adv. Methods Pract. Psychol. Sci. 1 : 3 299– 317 [Google Scholar]
- Vosgerau J , Simonsohn U , Nelson LD , Simmons JP 2019 . 99% impossible: a valid, or falsifiable, internal meta-analysis. J. Exp. Psychol. Gen. 148 : 9 1628– 39 [Google Scholar]
- Wagenmakers E-J , Beek T , Dijkhoff L , Gronau QF , Acosta A et al. 2016 . Registered Replication Report: Strack, Martin, & Stepper (1988). Perspect. Psychol. Sci. 11 : 6 917– 28 [Google Scholar]
- Wagenmakers E-J , Wetzels R , Borsboom D , van der Maas HL. 2011 . Why psychologists must change the way they analyze their data. The case of psi: comment on Bem (2011). J. Pers. Soc. Psychol. 100 : 3 426– 32 [Google Scholar]
- Wagenmakers E-J , Wetzels R , Borsboom D , van der Maas HL , Kievit RA. 2012 . An agenda for purely confirmatory research. Perspect. Psychol. Sci. 7 : 6 632– 38 [Google Scholar]
- Wagge J , Baciu C , Banas K , Nadler JT , Schwarz S et al. 2018 . A demonstration of the Collaborative Replication and Education Project: replication attempts of the red-romance effect. PsyArXiv, June 22. https://doi.org/10.31234/osf.io/chax8 [Crossref]
- Whitcomb D , Battaly H , Baehr J , Howard-Snyder D. 2017 . Intellectual humility: owning our limitations. Philos. Phenomenol. Res. 94 : 3 509– 39 [Google Scholar]
- Wicherts JM , Bakker M , Molenaar D. 2011 . Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PLOS ONE 6 : 11 e26828 [Google Scholar]
- Wiktop G. 2020 . Systematizing Confidence in Open Research and Evidence (SCORE). Defense Advanced Research Projects Agency https://www.darpa.mil/program/systematizing-confidence-in-open-research-and-evidence [Google Scholar]
- Wilkinson MD , Dumontier M , Aalbersberg IJ , Appleton G , Axton M et al. 2016 . The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3 : 1 160018 [Google Scholar]
- Wilson BM , Harris CR , Wixted JT. 2020 . Science is not a signal detection problem. PNAS 117 : 11 5559– 67 [Google Scholar]
- Wilson BM , Wixted JT. 2018 . The prior odds of testing a true effect in cognitive and social psychology. Adv. Methods Pract. Psychol. Sci. 1 : 2 186– 97 [Google Scholar]
- Wintle B , Mody F , Smith E , Hanea A , Wilkinson DP et al. 2021 . Predicting and reasoning about replicability using structured groups. MetaArXiv, May 4. https://doi.org/10.31222/osf.io/vtpmb [Crossref]
- Yang Y , Youyou W , Uzzi B. 2020 . Estimating the deep replicability of scientific findings using human and artificial intelligence. PNAS 117 : 20 10762– 68 [Google Scholar]
- Yarkoni T. 2019 . The generalizability crisis. PsyArXiv, Nov. 22. https://doi.org/10.31234/osf.io/jqw35 [Crossref]
- Yong E 2012 . A failed replication draws a scathing personal attack from a psychology professor. National Geo-graphic March 10. https://www.nationalgeographic.com/science/phenomena/2012/03/10/failed-replication-bargh-psychology-study-doyen/ [Google Scholar]
Data & Media loading...
Supplementary Data
Download the Supplemental Material as a single PDF. Includes Supplemental Text, Supplemental Tables 1-12, and Supplemental Figures 1-2.
- Article Type: Review Article
Most Read This Month
Most cited most cited rss feed, job burnout, executive functions, social cognitive theory: an agentic perspective, on happiness and human potentials: a review of research on hedonic and eudaimonic well-being, sources of method bias in social science research and recommendations on how to control it, mediation analysis, missing data analysis: making it work in the real world, grounded cognition, personality structure: emergence of the five-factor model, motivational beliefs, values, and goals.
The Importance of Replication
- Living reference work entry
- First Online: 18 April 2018
- Cite this living reference work entry
- Christopher J. Holden 3 &
- Garrett Goodwin 4
74 Accesses
1 Citations
1 Altmetric
Replication ; Reproducibility ; Reproducibility project
Replication is a crucial part of scientific practice. Replication involves following the methods and procedures of a previous study in an attempt to reproduce the findings of that study. This is done to ensure that the findings and any theoretical conclusions are valid.
Introduction
Replication and falsification are at the core of scientific practice. Replication is the effort to reproduce findings from previous research using the same, or similar, methods to further validate these findings and build theory. Similarly, falsification is the practice of generating predictions (i.e., hypotheses) that can be demonstrated to be false and adjusting theory as observations either support or refute these predictions. These principles extend back to the likes of Sir Francis Bacon, Karl Popper, and R.A. Fisher (Bacon 1859 ; Fisher 1925 ; Popper 1992 ) Furthermore, they are often used to delineate between science and...
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Institutional subscriptions
Bacon, R. (1859). Fr. Rogeri Bacon Opera quædam hactenus inedita. Vol. I. Containing I. – Opus tertium. II. – Opus minus. III. – Compendium philosophiæ . London: Longman, Green, Longman and Roberts. Retrieved from http://books.google.com/books?id=wMUKAAAAYAAJ . (Original work published 1267).
Google Scholar
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100 , 407–425.
Article Google Scholar
Bruns, S. B., & Ioannidis, J. P. (2016). P-curve and p-hacking in observational research. PLoS One, 11 . https://doi.org/10.1371/journal.pone.0149144 .
Fisher, R. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22 , 700–725.
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research. PLoS Biology, 13 . https://doi.org/10.1371/journal.pbio.1002165 .
Gelman, A. (2016, October 03). Why does the replication crisis seem worse in psychology? Retrieved from http://www.slate.com/articles/health_and_science/science/2016/10/why_the_replication_crisis_seems_worse_in_psychology.html
Hüffmeier, J., Mazei, J., & Schultze, T. (2016). Reconceptualizing replication as a sequence of different studies: A replication typology. Journal of Experimental Social Psychology, 66 , 81–92.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23 , 524–532.
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2 , 196–217.
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70 , 487–498.
Open Science Collaboration. (2012). An open large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7 , 657–660.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349 , 1–8.
Popper, K. (1992). The logic of scientific discovery . New York: Routledge. (Original work published 1934).
Yong, E. (2016, March 04). Psychology’s replication crisis can’t be wished away. Retrieved from https://www.theatlantic.com/science/archive/2016/03/psychologys-replication-crisis-cant-be-wished-away/472272/
Download references
Author information
Authors and affiliations.
Appalachian State University, Boone, NC, USA
Christopher J. Holden
Western Carolina University, Cullowhee, NC, USA
Garrett Goodwin
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Christopher J. Holden .
Editor information
Editors and affiliations.
Oakland University, Rochester, USA
Virgil Zeigler-Hill
Todd K. Shackelford
Section Editor information
Humboldt-Universität zu Berlin, Berlin, Germany
Matthias Ziegler
Rights and permissions
Reprints and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this entry
Cite this entry.
Holden, C.J., Goodwin, G. (2018). The Importance of Replication. In: Zeigler-Hill, V., Shackelford, T. (eds) Encyclopedia of Personality and Individual Differences. Springer, Cham. https://doi.org/10.1007/978-3-319-28099-8_1352-1
Download citation
DOI : https://doi.org/10.1007/978-3-319-28099-8_1352-1
Received : 28 March 2018
Accepted : 03 April 2018
Published : 18 April 2018
Publisher Name : Springer, Cham
Print ISBN : 978-3-319-28099-8
Online ISBN : 978-3-319-28099-8
eBook Packages : Springer Reference Behavioral Science and Psychology Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences
- Publish with us
Policies and ethics
- Find a journal
- Track your research
Why is Replication in Research Important?
Replication in research is important because it allows for the verification and validation of study findings, building confidence in their reliability and generalizability. It also fosters scientific progress by promoting the discovery of new evidence, expanding understanding, and challenging existing theories or claims.
Updated on June 30, 2023
Often viewed as a cornerstone of science , replication builds confidence in the scientific merit of a study’s results. The philosopher Karl Popper argued that, “we do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them.”
As such, creating the potential for replication is a common goal for researchers. The methods section of scientific manuscripts is vital to this process as it details exactly how the study was conducted. From this information, other researchers can replicate the study and evaluate its quality.
This article discusses replication as a rational concept integral to the philosophy of science and as a process validating the continuous loop of the scientific method. By considering both the ethical and practical implications, we may better understand why replication is important in research.
What is replication in research?
As a fundamental tool for building confidence in the value of a study’s results, replication has power. Some would say it has the power to make or break a scientific claim when, in reality, it is simply part of the scientific process, neither good nor bad.
When Nosek and Errington propose that replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research, they revive its neutrality. The true purpose of replication, therefore, is to advance scientific discovery and theory by introducing new evidence that broadens the current understanding of a given question.
Why is replication important in research?
The great philosopher and scientist, Aristotle , asserted that a science is possible if and only if there are knowable objects involved. There cannot be a science of unicorns, for example, because unicorns do not exist. Therefore, a ‘science’ of unicorns lacks knowable objects and is not a ‘science’.
This philosophical foundation of science perfectly illustrates why replication is important in research. Basically, when an outcome is not replicable, it is not knowable and does not truly exist. Which means that each time replication of a study or a result is possible, its credibility and validity expands.
The lack of replicability is just as vital to the scientific process. It pushes researchers in new and creative directions, compelling them to continue asking questions and to never become complacent. Replication is as much a part of the scientific method as formulating a hypothesis or making observations.
Types of replication
Historically, replication has been divided into two broad categories:
- Direct replication : performing a new study that follows a previous study’s original methods and then comparing the results. While direct replication follows the protocols from the original study, the samples and conditions, time of day or year, lab space, research team, etc. are necessarily different. In this way, a direct replication uses empirical testing to reflect the prevailing beliefs about what is needed to produce a particular finding.
- Conceptual replication : performing a study that employs different methodologies to test the same hypothesis as an existing study. By applying diverse manipulations and measures, conceptual replication aims to operationalize a study’s underlying theoretical variables. In doing so, conceptual replication promotes collaborative research and explanations that are not based on a single methodology.
Though these general divisions provide a helpful starting point for both conducting and understanding replication studies, they are not polar opposites. There are nuances that produce countless subcategories such as:
- Internal replication : when the same research team conducts the same study while taking negative and positive factors into account
- Microreplication : conducting partial replications of the findings of other research groups
- Constructive replication : both manipulations and measures are varied
- Participant replication : changes only the participants
Many researchers agree these labels should be confined to study design, as direction for the research team, not a preconceived notion. In fact, Nosek and Errington conclude that distinctions between “direct” and “conceptual” are at least irrelevant and possibly counterproductive for understanding replication and its role in advancing knowledge.
How do researchers replicate a study?
Like all research studies, replication studies require careful planning. The Open Science Framework (OSF) offers a practical guide which details the following steps:
- Identify a study that is feasible to replicate given the time, expertise, and resources available to the research team.
- Determine and obtain the materials used in the original study.
- Develop a plan that details the type of replication study and research design intended.
- Outline and implement the study’s best practices.
- Conduct the replication study, analyze the data, and share the results.
These broad guidelines are expanded in Brown’s and Wood’s article , “Which tests not witch hunts: a diagnostic approach for conducting replication research.” Their findings are further condensed by Brown into a blog outlining four main procedural categories:
- Assumptions : identifying the contextual assumptions of the original study and research team
- Data transformations : using the study data to answer questions about data transformation choices by the original team
- Estimation : determining if the most appropriate estimation methods were used in the original study and if the replication can benefit from additional methods
- Heterogeneous outcomes : establishing whether the data from an original study lends itself to exploring separate heterogeneous outcomes
At the suggestion of peer reviewers from the e-journal Economics, Brown elaborates with a discussion of what not to do when conducting a replication study that includes:
- Do not use critiques of the original study’s design as a basis for replication findings.
- Do not perform robustness testing before completing a direct replication study.
- Do not omit communicating with the original authors, before, during, and after the replication.
- Do not label the original findings as errors solely based on different outcomes in the replication.
Again, replication studies are full blown, legitimate research endeavors that acutely contribute to scientific knowledge. They require the same levels of planning and dedication as any other study.
What happens when replication fails?
There are some obvious and agreed upon contextual factors that can result in the failure of a replication study such as:
- The detection of unknown effects
- Inconsistencies in the system
- The inherent nature of complex variables
- Substandard research practices
- Pure chance
While these variables affect all research studies, they have particular impact on replication as the outcomes in question are not novel but predetermined.
The constant flux of contexts and variables makes assessing replicability, determining success or failure, very tricky. A publication from the National Academy of Sciences points out that replicability is obtaining consistent , not identical, results across studies aimed at answering the same scientific question. They further provide eight core principles that are applicable to all disciplines.
While there is no straightforward criteria for determining if a replication is a failure or a success, the National Library of Science and the Open Science Collaboration suggest asking some key questions, such as:
- Does the replication produce a statistically significant effect in the same direction as the original?
- Is the effect size in the replication similar to the effect size in the original?
- Does the original effect size fall within the confidence or prediction interval of the replication?
- Does a meta-analytic combination of results from the original experiment and the replication yield a statistically significant effect?
- Do the results of the original experiment and the replication appear to be consistent?
While many clearly have an opinion about how and why replication fails, it is at best a null statement and at worst an unfair accusation. It misses the point, sidesteps the role of replication as a mechanism to further scientific endeavor by presenting new evidence to an existing question.
Can the replication process be improved?
The need to both restructure the definition of replication to account for variations in scientific fields and to recognize the degrees of potential outcomes when comparing the original data, comes in response to the replication crisis . Listen to this Hidden Brain podcast from NPR for an intriguing case study on this phenomenon.
Considered academia’s self-made disaster, the replication crisis is spurring other improvements in the replication process. Most broadly, it has prompted the resurgence and expansion of metascience , a field with roots in both philosophy and science that is widely referred to as "research on research" and "the science of science." By holding a mirror up to the scientific method, metascience is not only elucidating the purpose of replication but also guiding the rigors of its techniques.
Further efforts to improve replication are threaded throughout the industry, from updated research practices and study design to revised publication practices and oversight organizations, such as:
- Requiring full transparency of the materials and methods used in a study
- Pushing for statistical reform , including redefining the significance of the p-value
- Using pre registration reports that present the study’s plan for methods and analysis
- Adopting result-blind peer review allowing journals to accept a study based on its methodological design and justifications, not its results
- Founding organizations like the EQUATOR Network that promotes transparent and accurate reporting
Final thoughts
In the realm of scientific research, replication is a form of checks and balances. Neither the probability of a finding nor prominence of a scientist makes a study immune to the process.
And, while a single replication does not validate or nullify the original study’s outcomes, accumulating evidence from multiple replications does boost the credibility of its claims. At the very least, the findings offer insight to other researchers and enhance the pool of scientific knowledge.
After exploring the philosophy and the mechanisms behind replication, it is clear that the process is not perfect, but evolving. Its value lies within the irreplaceable role it plays in the scientific method. Replication is no more or less important than the other parts, simply necessary to perpetuate the infinite loop of scientific discovery.
Charla Viera, MS
See our "Privacy Policy"
- Open access
- Published: 01 June 2016
Psychology, replication & beyond
- Keith R. Laws 1
BMC Psychology volume 4 , Article number: 30 ( 2016 ) Cite this article
16k Accesses
18 Citations
44 Altmetric
Metrics details
Modern psychology is apparently in crisis and the prevailing view is that this partly reflects an inability to replicate past findings. If a crisis does exists, then it is some kind of ‘chronic’ crisis, as psychologists have been censuring themselves over replicability for decades. While the debate in psychology is not new, the lack of progress across the decades is disappointing. Recently though, we have seen a veritable surfeit of debate alongside multiple orchestrated and well-publicised replication initiatives. The spotlight is being shone on certain areas and although not everyone agrees on how we should interpret the outcomes, the debate is happening and impassioned. The issue of reproducibility occupies a central place in our whig history of psychology.
In the parlance of Karl Popper, the notion of falsification is seductive – some seem to imagine that it identifies an act as opposed to a process . It often carries the misleading implication that hypotheses can be readily discarded in the face of something called a ‘failed’ replication. Popper [ 46 ] was quite transparent when he declared “… a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory . In other words, we only accept the falsification if a low level empirical hypothesis which describes such an effect is proposed and corroborated.” (p.203: my italics). Popper’s view might reassure those whose psychological models have recently come under scrutiny through replication initiatives. We cannot, nor should we, close the door on a hypothesis because a study fails to be replicated. The hypothesis is not nullified and ‘nay-saying’ alone is an insufficient response from scientists. Like Popper, we might expect a testable alternative hypothesis that attempts to account for the discrepancy across studies; and one that itself may be subject to testing rather than merely being ad hoc . In other words, a ‘failed’ replication is not, in itself, the answer to a question, but a further question.
Replication, replication, replication
At least two key types of replication exist: direct and conceptual. Conceptual replication generally refers to cases where researchers ‘tweak’ the methods of previous studies [ 43 ] and when successful, may be informative with regard to the boundaries and possible moderators of an effect. When a conceptual replication fails, however, fewer clear implications exist for the original study because of likely differences in procedure or stimuli and so on. For this reason, we have seen an increased weight given to direct replications.
How often do direct and conceptual replications occur in psychology? Screening 100 of the most-cited psychology journals since 1900, Makel, Plucker & Hegarty [ 40 ] found that approximately 1.6 % of all psychology articles used the term replication in the text. A further more detailed analysis of 500 randomly selected articles revealed that only 68 % using the term replication were actual replications. They calculated an overall replication rate of 1.07 % and Makel et al. [ 40 ] found that only 18 % of those were direct rather than conceptual replications.
The lack of replication in psychology is systemic and widespread, and particularly the bias against publishing direct replications. In their survey of social science journal editors , Neuliep & Crandall [ 42 ] found almost three quarters preferred to publish novel findings rather than replications. In a parallel survey of reviewers for social science journals, Neuliep & Crandall [ 43 ] found over half (54 %) stated a preference for new findings over replications. Indeed, reviewers stated that replications were “Not newsworthy” or even a “Waste of space”. By contrast, comments from natural science journal editors present a more varied picture, with comments ranging from “Replication without some novelty is not accepted” to “Replication is rarely an issue for us…since we publish them.” [ 39 ].
Despite an enduring historical abandonment of replication, the tide appears to be turning. Makel et al. [ 40 ] found that the replication rate after the year 2000 was 1.84 times higher than for the period between 1950 and 1999. In a more recent evolution, several large-scale direct replication projects have emerged during the past 2 years including: the Many Labs project [ 33 ]; a set of preregistered replications published in a special issue of Social Psychology (Edited by [ 44 ]); the Reproducibility Project of the Open Science Collaboration [ 45 ]; and the Pipeline Project by Schweinsberg et al. [ 50 ]. In two of these projects (Many Labs by [ 33 ]; Pipeline Project by [ 50 ]), a group of researchers replicated samples of studies, with each group replicating all studies. In the two remaining projects, a number of research groups each replicated one study, selected from a sample of studies (Registered Reports by [ 44 ]; Open Science Collaboration, [ 45 ]). Each project ensured that replications were sufficiently powered (typically in excess of 90 % -thus offering a very good probability of detecting true effects) and where possible, used the original materials and stimuli as provided by the original authors. It is worth considering each in more detail.
Many Labs involved 36 research groups across 12 countries who replicated 13 psychological studies in over 6,000 participants. Studies of classic and newer effects were selected partly because they had simple designs that could be adapted for online administration. Reassuringly perhaps, 10 of the 13 effects replicated consistently across 36 different samples with, of course, some variability in the effect size reported compared to the original studies – some smaller but also some larger. One effect received weak support. Only two studies consistently failed to replicate and both involved what are described as ‘ social priming’ phenomena. One study where ‘accidental’ exposure to a US flag resulted in increased conservatism amongst Americans [ 11 ]. Participants viewed four photos and were asked to just estimate the time-of-day in the photo – the US flag appeared in two photos. Following this, they completed an 8-item questionnaire assessing their views toward various political issues (e.g., abortion, gun control). In the second priming study, exposure to ‘money’ had resulted in endorsement of the current social system [ 12 ]. In this study, participants completed demographic questions against a background that showed a faint picture of US $100 bills or the same background but blurred. Each of these two priming experiments had a single significant p -value (out of 36 replications) and for flag priming, it was in the opposite direction to that expected.
Turning to the special issue of Social Psychology edited by Nosek & Lakens [ 44 ]. This contained a series of articles replicating important results in social psychology. Important was broadly defined as “…often cited, a topic of intense scholarly or public interest, a challenge to established theories), but should also have uncertain truth value (e.g., few confirmations, imprecise estimates of effect sizes).” One might euphemistically describe the studies as curios . The articles were first submitted as Registered Reports and reviewed prior to data collection, with authors being assured their findings would be published regardless of outcome, as long as they adhered to the registered protocol. Attempted replications included the “Romeo and Juliet effect” – does parental interference lead to increases in love and commitment (Original: [ 17 ]; Replication: Sinclair, Hood, & Wright, [ 53 ]), does experiencing physical warmth (warm therapeutic packs) increase judgments of interpersonal warmth (Original: [ 58 ]; Replication: Lynott, Corker, Wortman, Connell, Donnellan, Lucas, & O’Brien, [ 38 ]), does recalling unethical behavior lead participants to see the room as darker (Original: [ 3 ]; Replication: [ 10 ]); does physical cleanliness reduce the severity of moral judgments (original : [ 49 ]: [ 28 ]). In contrast to high replication rate of Many Labs , the Registered Reports replications failed to confirm the results in 10 of 13 studies.
In the largest crowdsourced effort to date, the OSC Reproducibility project involved 270 collaborators attempting to replicate 100 findings from 3 major psychology journals Psychological Science (PSCI), Journal of Personality and Social Psychology (JPSP), and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP: LMC). While 97 of 100 studies originally reported statistically significant results, only 36 % of the replications did so with a mean effect size of around half of that reported in the original studies.
All of the journals exhibited a large reduction of around 50 % in effect sizes, with replications from JPSP particularly affected - shrinking by 75 % from 0.29 to 0.07. The replicability in one domain of psychology (good or poor) in no way guarantees what will happen in another domain. One thing we know from this project, is that “…reproducibility was stronger in studies and journals representing cognitive psychology than social psychology topics. For example, combining across journals, 14 of 55 (25 %) of social psychology effects replicated by the P < 0.05 criterion, whereas 21 of 42 (50 %) of cognitive psychology effects did so.” The reasons for such a difference are debatable, but provide no licence to either congratulate cognitive psychologists or berate social psychologists. Indeed, the authors paint a considered and faithful picture of what their findings mean when they conclude “…how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science”. (Open Science Collaboration p.4716-7)
The studies that were not selected for replication are informative – they were described as “…deemed infeasible to replicate because of time, resources, instrumentation, dependence on historical events, or hard-to-access samples… [and some] required specialized samples (such as macaques or people with autism), resources (such as eye tracking machines or functional magnetic resonance imaging), or knowledge making them difficult to match with teams”. Thus, the main drivers of replication are often economic in terms of time, money and human investment. High cost studies are likely to remain castles in the air, leaving us with little insight about replicability rates in some areas such as functional imaging (e.g. [ 9 ]), clinical and health psychology (see Coyne, this issue), and neuropsychology.
The ‘ Pipeline project’ by Schweinsberg et al. [ 50 ] intentionally used a non-adversarial approach. They crowdsourced 25 research teams across various countries to replicate a series of 10 unpublished moral-judgment experiments from the lead author’s (Uhlmann) lab i.e., in the pipeline. This speaks directly to Lykken’s [ 37 ] proposal from nearly 50 years ago that “…ideally all experiments would be replicated before publication” although at that time, he deemed it ‘impractical’.
Pipeline replications included: the Bigot–misanthrope effect – whether participants judge a manager who selectively mistreats racial minorities as a more blameworthy person than a manager who mistreats all of his employees; Bad tipper effect - are people who leave a full tip, but entirely in pennies judged more negatively than someone who leaves less money, but in notes; the Burn-in-hell effect – do people perceive corporate executives as more likely to burn in hell than members of social categories defined by antisocial behaviour, such as vandals. Six of ten findings replicated across all of their replication criteria, one further finding replicated but with a significantly smaller effect size than the original, one finding replicated consistently in the original culture but not outside of it ( bad tipper replicated in US and not outside), and two findings effects were unsupported.
The headline replication rates differed considerably across projects – occurring more frequently for Many Labs (77 %) and the Pipeline Project (60 %) than Registered Reports (30 %) and the Open Science Collaboration (36 %). Why are replication rates lower in the latter two projects? Possible explanations include the choice of likely versus unlikely replication candidates. Amongst the Many Labs studies, some had already previously been replicated and were selected knowing this fact. By contrast, the studies in the Pipeline project had not been previously replicated (indeed, not even previously published). Also important from a different perspective is whether each study was replicated only once by one group or multiple times by many groups.
In the Many Labs and Pipeline projects, 36 and 25 separate research groups were replicating each of 13 and 10 studies respectively. Multiple analyses lend themselves to meta-analytic techniques and analysis of the heterogeneity across research groups examining the same effect – the extent to which they accord in their effect sizes or not. The Many Labs project reported I2 values, which estimate the proportion of variation due to heterogeneity rather than chance. In the majority of cases, heterogeneity was small to moderate or even non-existent (e.g. across the 36 replications for both of the social priming studies: flag and money). Indeed, heterogeneity of effect sizes was greater between studies than within studies. When heterogeneity was greater, it was - perhaps surprisingly - where mean effect sizes were largest. Nonetheless, Many Labs reassuringly shows that some effects are highly replicable across research groups, countries, presentational differences (online versus face to face).
Counter-intuitive and even fanciful psychological hypotheses are not necessarily more likely to be false, but believing them to be so may influence researchers– even implicitly – in terms of how replications are conducted. In their extensive literature search, Makel et al. [ 40 ] reported that most direct replications are conducted by authors who proposed the original findings. This raises the thorny question of who should replicate? Almost 50 years ago Bakan [ 2 ] sagely warned that “If an investigator attempts to replicate his own investigation at another time, he will inevitably be under the influence of what he has already done…He should challenge, for example, his personal identification with the results he has already obtained, and prepare himself for finding both novelty and contradiction with respect to his earlier investigation” and that “…If one investigator is interested in replicating the investigation of another investigator, he should carefully take into account the possibility of suggestion, or his willingness to accept the results of the earlier investigator. …He should take careful cognizance of possible motivation for showing the earlier investigator to be in error, etc. [p. 110].” The irony is that as psychologists, we should be acutely aware of such biases - we cannot ignore the psychology of replication in the replication of psychology.
What are we replicating and why?
The cheap and easy.
Few areas of psychology have fallen under the replication lens and where they have, they are psychology’s equivalent to take-away meals – easy to prepare studies (e.g. often using online links to questionnaires). Hence, the focus has tended to be on studies from social and cognitive psychology, and not for example developmental or clinical studies, which are more prohibitive. Other notable examples exist such as cognitive neuropsychology, where the single case study has been predominant for decades – how can anyone recreate the brain injury and subsequent cognitive testing in a second patient?
The contentious
We cannot assert that the totality– or even a representative sample - of psychology has been scrutinised for replication. We can also see why some may feel targeted – replication does not (and probably cannot) occur in a random fashion. The vast majority of psychological studies are overlooked. To date, psychologists have targeted the unexpected, the curious, and newsworthy findings; and largely within a narrow range of areas (cognitive and social primarily). As psychologists, the need to sample more widely ought to go without saying; and one corollary of this, is that it makes no sense to claim that psychology is in crisis.
Too often perhaps, psychologists have been attracted to replicating contentious topics such as social priming, ego-depletion, psychic ability and so on. Some high impact journals have become repositories for the attention-grabbing, strange, unexpected and unbelievable findings. This goes to the systemic heart of the matter. Hartshorne & Schachner [ 27 ] amongst many others have noted “…replicability is not systematically considered in measuring paper, researcher, and journal quality. As a result, the current incentive structure rewards the publication of non-replicable findings …” (p.3 my italics). This is nothing new in science, as the quest for scientific prestige has historically resulted in a conflict between the goals of science and the personal goals of the scientist (see [ 47 ]).
The preposterous
“If there is no ESP, then we want to be able to carry out null experiments and get no effect, otherwise we cannot put much belief in work on small effects in non-ESP situations. If there is ESP, that is exciting. However, thus far it does not look as if it will replace the telephone” (Mosteller [ 41 ], p 396)
From the opposite perspective, Jim Coyne (this issue) maintains that psychology would benefit from some “…provision for screening out candidates for replication for which a consensus could be reached that the research hypotheses were improbable and not warranting the effort and resources required for a replication to establish this.” The frustration of some psychologists is palpable as they peruse apparently improbable hypotheses. Coyne’s concern echoes that of Edwards [ 18 ] who half a century ago similarly remarked, “If a hypothesis is preposterous to start with, no amount of bias against it can be too great. On the other hand, if it is preposterous to start with, why test it ?” Edwards (p 402). How preposterous can we get? According to Simmons et al. [ 51 ], it is “…unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis. (p. 1359). Indeed, they managed to show by manipulating what they describe as researcher degrees of freedom (e.g. ‘data-peeking’, deciding when to stop testing participants, whether to exclude outlying data points) , that people appear to forget their age and claim to be 1.5 years younger after listening to the Beatles song “When I’m 64”.
The fact that seemingly incredible findings can be published raises disquiet about the methods normally employed by psychologists and in some circles, this has inflated to concerns about psychology more generally. Within the methodological and statistical frameworks that psychologists normally operate, we have to face the unpalatable possibility that the wriggle room for researchers is – unacceptably large. Further, it is implicitly reinforced, as Coyne notes, by the actions of some journals as well as media outlets– and until that is adequately addressed, little will change.
The negative
Interestingly, the four replication projects outlined above almost wholly neglected null findings. To date, replication efforts are invariably aimed at positive findings. Should we not also try to replicate null findings? Given the propensity for positive findings to become nulls , what is the likelihood of reverse effects in more adequately powered studies? The emphasis on replicating positive outcomes betrays the wider bias that psychologist have against null findings per se (Laws [ 36 ]). The overwhelming majority of published findings in psychology are positive (93.5 %: [ 54 ]) and the aversion to null findings may well be worse in psychology than other sciences [ 20 ]. Intriguingly, we can see a hint of this issue inthe OSC reproducibility project, which did include 3 %of sampled findings that were null initially - and whiletwo were confirmed as nulls, one did indeed become significant.As psychologists, we might ponder how the biasagainst publishing null findings finds a clear echo in the bias against replicating null findings.
A conflict between belief and evidence
The wriggle room is fertile ground for psychologists to exploit the disjunction between belief and evidence that seems quite pervasive in psychology. As remarked upon by Francis “Contrary to its central role in other sciences,it appears that successful replication is sometimes not related to belief about an effect in experimental psychology. A high rate of successful replication is not sufficient to induce belief in an effect [ 8 ] , nor is a high rate of successful replication necessary for belief [ 22 ].” The Bem [ 8 ] study documented “experimental evidence for anomalous retroactive influences on cognition and affect” or in plain language…precognition. Using multiple tasks, and nine experiments involving over 1,000 participants, Bem had implausibly demonstrated that the performance of participants reflected what happened after they had made their decision. For example, on a memory test, participants were more likely to remember words that they were later asked to practise i.e. memory rehearsal seemingly worked back in time. In another task, participants had to select which of two curtains on a computer screen hid an erotic image, and they did so at a level significantly greater than chance, but not when the hidden images were less titillating. Furthermore, Bem and colleagues [ 7 ] later meta-analysed 90 previous studies to establish a significant effect size of 0.22.
Bem presents nine replications of a phenomenon and a large meta-analysis, yet we do not believe it, while other phenomena do not so readily replicate (e.g. bystander apathy [ 22 ]) but we do believe in them. Francis [ 23 ] bleakly concludes “ The scientific method is supposed to be able to reveal truths about the world, and the reliability of empirical findings is supposed to be the final arbiter of science; but this method does not seem to work in experimental psychology as it is currently practiced .” Whether we believe in Bem’s precognition, social priming, or indeed, any published psychological finding – researchers are operating within the methodological and statistical wriggle room . The task for psychologists is to view these phenomena like any other scientific question i.e. in need of explanation. If they can close-down the wriggle room, then we might expect such curios and anomalies to evaporate in a cloud of nonsignificant results.
While some might view the disjunction between belief and evidence as ‘healthy skepticism’, others might also describe it as resistance to evidence or even anti-science. A pertinent example comes from Lykken [ 37 ] who described a study in which people who see frogs in a Rorschach test – ‘frog responders’ – were more likely to have an eating disorder [ 48 ] – a finding interpreted as evidence of harboring oral impregnation fantasies and an unconscious belief in anal birth. Lykken asked 20 clinician colleagues to estimate the likelihood of this ‘cloacal theory of birth’ before and after seeing Sapolsky’s evidence. Beforehand, they reported a “…median value of 0.01, which can be interpreted to mean, roughly, ‘I don't believe it’ ” and after being shown the confirmatory evidence “… the median unchanged at 0.01. I interpret this consensus to mean, roughly, ‘I still don’t believe it.’” (p. 151–152) . Lykken remarked that normally when a prediction is confirmed by experiment, we might expect “…a nontrivial increment in one’s confidence in that theory should result, especially when one’s prior confidence is low… [but that] this rule is wrong not only in a few exceptional instances but as it is routinely applied to the majority of experimental reports in the psychological literature” p.152 . Often such claims give rise to a version of Feynman’s maxim that “Extraordinary claims require extraordinary evidence”. The remarkableness of a claim, however, is not necessarily relevant to either the type or the scale of evidence required. Instead of setting different criteria for the ordinary and extraordinary, we need to continue to close the wriggle room .
Beliefs and the failure to self-correct
“Scientists should not be in the business of simply ignoring literature that they do not like because it contests their view.” [ 30 ]
Taking this to the opposite extreme, some researchers may choose to ignore the findings of meta-analyses at the expense of selected individual studies that accord more with their view. Giner-Sorolla [ 24 ] maintained that “…meta-analytic validation is not seen as necessary to proclaim an effect reliable. Textbooks, press reports, and narrative reviews often rest conclusions on single influential articles rather than insisting on a replication across independent labs and multiple contexts ” (p 564, my italics).
Stoebe & Strack rightly point-out, “Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis.” [and] ‘believers’ “…will keep on believing, pointing at the successful replications and derogating the unsuccessful ones, whereas the nonbelievers will maintain their belief system drawing on the failed replications for support of their rejection of the original hypothesis.” (p.64). Psychology rarely – if ever- proceeds with an unequivocal knock-out blow delivered by a negative finding or even a meta-analysis. Indeed, psychology often has more of the feel of trench warfare, where models and hypotheses are ultimately abandoned largely because researchers lose interest [ 26 ].
Jussim et al. [ 30 ] provide some interesting examples of precisely how social psychology doesn’t seem to correct itself when big findings fail to replicate. If doubts are raised about an original finding then as Jussim et al point out, we might expect citations to reflect this debate, the uncertainly and as such the original and the unsuccessful replications would be expected to be fairly equally cited.
In a classic study, Darley & Gross [ 15 ] found people applied a stereotype about social class when they saw a young girl taking a maths test either after seeing her playing in an affluent or poor background. After obtaining the original materials and following the procedure carefully, Baron et al. [ 6 ] published two failed replications using more than twice as many participants. Not only did they fail to replicate, the evidence was in the opposite direction. Such findings ought to encourage debate with relatively equal attention to the pro and con studies in the literature - alas no. Jussim et al. reported that “…since 1996, the original study has been cited 852 times, while the failed replications have been cited just 38 times (according to Google Scholar searches conducted on 9/11/15).”
This is not an unusual case, as Jussim et al. report several examples of failed replications not being cited, while original studies continue to be readily cited. The infamous and seminal study by Bargh and colleagues [ 5 ] showed that unconsciously priming people with an ‘elderly stereotype’ (unscrambling jumbled sentences that contained words like: old, lonely, bingo, wrinkle ) makes them subsequently walk more slowly. However, Doyen et al. [ 16 ] failed to replicate the finding using more accurate measures of walking speed. Since 2013, Bargh et al. has been cited 900 times and Doyen et al. 192. Or a meta-analysis of 88 studies by Jost et al. [ 29 ] showing that conservativism is a syndrome characterized by rigidity, dogmatism, prejudice, and fear, not replicated by a larger better controlled meta-analysis conducted by Van Hiel and colleagues [ 57 ]. Since 2010, the former has been cited 1030 times while the latter a mere 60 by comparison. Jussim et al. suggest “This pattern of ignoring correctives likely leads social psychology to overstate the extent to which evidence supports the original study’s conclusions…[] it behooves researchers to grapple with the full literature, not just the studies conducive to their preferred arguments”.
Meta-analysis: rescue remedy or statistical alchemy?
Some view meta-analysis as the closest thing we have to a definitive approach for establishing the veracity and reliability of an effect. In the context of discussing social priming experiments, John Bargh [ 4 ] declared that “… In science the way to answer questions about replicability of effects is through statistical techniques such as meta-analysis ”. Others are more skeptical: “Meta-analysis is a reasonable way to search for patterns in previously published research. It has serious limitations, however, as a method for confirming hypotheses and for establishing the replicability of experiments” (p. 486 Hyman, 2010). Meta-analysis is not a magic dust that we can sprinkle over primary literatures to elucidate necessary truths. Likewise totemically accumulating replicated findings, in itself, does not necessarily prove anything (pace Popper). Does it matter if we replicate a finding once, twice, or 20 times, what ratio of positive to negative outcomes do we find acceptable? Answers or rules of thumb do not exist – it often comes down to our beliefs in psychology.
This special issue of BMC Psychology contains 4 articles (Taylor & Munafo, [ 56 ]; Lakens, Hilgaard & Staaks [ 34 ]; Coppens, Verkoeijen, Bouwmeester & Rikers, [ 13 ]; Coyne [ 14 ]) and in each, meta-analysis occupies a pivotal place. As shown by Taylor & Munafo (current issue), meta analyses have proliferated, are highly cited and “…most worryingly, the perceived authority of the conclusions of a meta-analysis means that it has become possible to use a meta-analysis in the hope of having the final word in an academic debate.” As with all methods, meta-analysis has its own limitations and retrospective validation via meta-analysis is not a substitute for prospective replication using adequately powered trials, but they do have substantive role to play in the reproducibility question.
Judging the weight of evidence is never straightforward and whether a finding sustains in psychology often reflect our beliefs almost as much as the evidence. Indeed, meta-analysis rightly or wrongly enables some ideas to persist despite a lack of support at the level of individual study or trial. This has certainly been argued in the use of meta-analyses to establish a case for psychic abilities, where Storm, Tressoldi & Di Risio [ 55 ] identify how “It distorts what scientists mean by confirmatory evidence. It confuses retrospective sanctification with prospective replicability.” (p.489)
This is a kind of free-lunch’ notion of meta-analysis. Feinstein [ 21 ] even stated that “ meta-analysis is analogous to statistical alchemy for the 21st century … the main appeal is that it can convert existing things into something better. “Significance” can be attained statistically when small group sizes are pooled into big ones” (p. 71). Undoubtedly, the conclusions of meta-analyses may prove unreliable where small numbers of nonsignificant trials are pooled to produce significant effects [ 19 ]. Nonetheless, it is also quite feasible for a majority of negative outcomes in a literature and still produce a reliable overall significant effect size (e.g. streptokinase: [ 35 ]).
Two of the papers presented here (Lakens et al. this issue; Taylor & Munafo this issue) offer extremely good suggestions relating to some of these conflicts in meta-analytic findings. Lakens and colleagues offer 6 recommendations, including permitting others to “re-analyze the data to examine how sensitive the results are to subjective choices such as inclusion criteria” and enabling this by providing links to data files that permit such analysis. Currently, we also need to address data sharing in regular papers. Sampling papers published in one year in the top 50 high-impact journals, Alsheikh-Ali et al. [ 1 ] reported that a substantial proportion of papers published in high-impact journals “…are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals”. Such efforts for transparency are extremely welcome and indeed, echo the posting online of our interactive CBT for schizophrenia meta-analysis database ( http://www.cbtinschizophrenia.com/ ), which has been used by others to test new hypotheses (e.g. [ 25 ]).
Taylor & Munafo (this issue) advise greater triangulation of evidence and in this particular instance, supplementing traditional meta-analysis and P-curve analysis [ 52 ]. In passing, Taylor & Munafo also mention “…adversarial collaboration, where primary study authors on both sides of a particular debate contribute to an agreed protocol and work together to interpret the results”. The proposed version of adversarial collaboration proposed by Kahneman [ 31 ] urged scientists to engage in a “good-faith effort to conduct debates by carrying out joint research” (p. 729). More recently, he elaborated on this in the context of the furore over failed replications (Kahneman [ 32 ]). Coyne covers some aspects of this latest paper on replication etiquette and finds some of it wanting. It may however be possible to find some new adversarial middle ground, but it crucially depends upon psychologists being more open. Indeed, some aspects of adversarial collaboration could dovetail with Lakens et als’ proposal regarding hosting relevant data on web platforms. In such a scenario, opposing views could test their hypotheses in a public arena using a shared database.
In the context of adversarial collaboration, some uncertainty and difference of opinion exists about how we might accommodate the views of those being replicated. One possibility again requires openness and that is for those who are replicated to be asked to submit a review; and crucially, the review and replicator’s responses are then published alongside the paper. Indeed, this happened with the paper of Coppens et al. (this issue). They replicated the ‘testing effect’ reported by Carpenter (2009) – that information which has been retrieved from memory is better recalled than that which has simply been studied. Their replications and meta-analysis partially replicate the original findings, and Carpenter was one of the reviewers whose review is available alongside the paper (along with the author responses). Indeed, from its initiation, BMC Psychology has published all reviews and responses to reviewers alongside published papers. This degree of openness is unusual in psychology journals, but does offer readers a glimpse into the process behind a replication (or any paper), allows the person being replicated to contribute and comment on the replication, to reply and be published in the same journal at the same time.
Ultimately, the issues that psychologists face over replication are as much about our beliefs, biases and openness as anything else. We are not dispassionate about the outcomes that we measure. Maybe because the substance of our spotlight is people, cognition and brains, we sometimes care too much about the ‘truths’ we choose to declare. They have implications. Similarly, we should not ignore the incentive structures and conflicts between the personal goals of psychologists and the goals of science. They have implications. Finally, the attitudes of psychologists to the transparency of our science needs to change. They have implications.
Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JP. Public availability of published research data in high-impact journals. PLoS One. 2011;6(9):e24357.
Article PubMed PubMed Central Google Scholar
Bakan D. On method. San Francisco: Jossey-Bass; 1967.
Google Scholar
Banerjee P, Chatterjee P, & Sinha J. Is it light or dark? Recalling moral behavior changes perception of brightness. Psychol Sci 2012. 0956797611432497.
Bargh JA. Priming effects replicate just fine, thanks. Psychology Today 2012. Retrieved from https://www.psychologytoday.com/blog/the-natural-unconscious/201205/priming-effects-replicate-just-fine-thanks
Bargh JA, Chen M, Burrows L. Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. J Pers Soc Psychol. 1996;71(2):230.
Article PubMed Google Scholar
Baron RM, Albright L, Malloy TE. Effects of behavioral and social class information on social judgment. Pers Soc Psychol Bull. 1995;21(4):308–15.
Article Google Scholar
Bem D, Tressoldi P, Rabeyron T, Duggan M. Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events. F1000Research. 2015;4:1188.
Bem DJ. Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. J Pers Soc Psychol. 2011;100:407–25.
Bennett CM, Miller MB. How reliable are the results from functional magnetic resonance imaging? Ann N Y Acad Sci. 2010;1191(1):133–55.
Brandt MJ, IJzerman H, Blanken I. Does recalling moral behavior change the perception of brightness? A replication and meta-analysis of Banerjee, Chatterjee, and Sinha (2012). Soc Psychol. 2014;45:246–252.
Carter TJ, Ferguson MJ, Hassin RR. A single exposure to the American flag shifts support toward Republicanism up to 8 months later. Psychol Sci. 2011;22:1011–8.
Caruso EM, Vohs KD, Baxter B, Waytz A. Mere exposure to money increases endorsement of freemarket systems and social inequality. J Exp Psychol Gen. 2013;142:301–6.
Coppens LC, Verkoeijen PJL, Bouwmeester S & Rikers RMJP (in press, this issue) The testing effect for mediator final test cues and related final 4 test cues in online and 5 laboratory experiments. BMC Psychology.
Coyne JC (in press, this issue) Replication initiatives will not salvage the trustworthiness of psychology. BMC Psychology.
Darley JM, Gross PH. A hypothesis-confirming bias in labeling effects. J Pers Soc Psychol. 1983;44(1):20.
Doyen S, Klein O, Pichon C-L, Cleeremans A. Behavioral priming: It’s all in the mind but whose mind? PLoS One. 2012;7:1–7. doi: 10.1371/journal.pone.0029081 .
Driscoll R, Davis KE, Lipetz ME. Parental interference and romantic love: the Romeo and Juliet effect. J Pers Soc Psychol. 1972;24(1):1.
Edwards W. Tactical note on the relation between scientific and statistical hypotheses. Psychological Bulletin. 1965;63:400–402.
Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315:629–34.
Fanelli D. Negative results are disappearing from most disciplines and countries. Scientometrics. 2012;90:891–904.
Feinstein AR. Meta-analysis: Statistical alchemy for the 21st century. J Clin Epidemiol. 1995;48:71–9.
Fischer P, Krueger JI, Greitemeyer T, Vogrincic C, Kastenmüller A, Frey D, Heene M, Wicher M, & Kainbacher M The bystander-effect: A meta-analytic review on bystander intervention in dangerous and non-dangerous emergencies. Psychol Bull. 2011;137:517–37.
Francis G. Publication bias and the failure of replication in experimental psychology. Psychon Bull Rev 2012;1-17.
Giner-Sorolla R. Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science. Perspect Psychol Sci. 2012;7(6):562–71.
Gold C. Dose and effect in CBT for schizophrenia. Br J Psychiatry. 2015;207(3):269. doi: 10.1192/bjp.207.3.269 .
PubMed Google Scholar
Greenwald AG. There is nothing so theoretical as a good method. Perspect Psychol Sci. 2012;7:99–108.
Hartshorne J, Schachner A. Tracking replicability as a method of post-publication open evaluation. Front Comput Neurosci. 2012;6:1–14.
Johnson DJ, Cheung F, Donnellan MB. Does cleanliness influence moral judgments? A directreplication of Schnall, Benton, and Harvey (2008). Soc Psychol. 2014;45:209–215
Jost JT, Glaser J, Kruglanski AW, Sulloway FJ. Political conservatism as motivated social cognition. Psychol Bull. 2003;129(3):339.
Jussim L, Crawford JT, Anglin SM, Stevens ST, Duarte JL. Interpretations and methods: Towards a more effectively self-correcting social psychology. J Exp Soc Psychol. 2016. (in press)
Kahneman D. Experiences of collaborative research. Am Psychol. 2003;58(9):723.
Kahneman D. A new etiquette for replication. Social Psychology. 2014;45(4):310.
Klein RA, Ratliff K, Vianello M, Adams Jr AB, Bahník S, Bernstein NB, Cemalcilar Z. Investigating variation in replicability. A “Many Labs” Replication Project. Soc Psychol. 2014;45:142–152.
Lakens D, Hilgard J & Staaks J (in press, this issue) On the Reproducibility of Meta-Analyses: Six Practical Recommendations. BMC Psychology.
Lau J, Antman EM, Jimenez-Silva J, Kupelnick B, Mosteller F, Chalmers TC. Cumulative meta-analysis of therapeutic trials for myocardial infarction. N Engl J Med. 1992;327:248–54.
Laws KR. Negativland–A home for all findings in psychology. BMC Psychology. 2013;1(2):1–8. doi:10.1186/2050-7283-1-2.
Lykken DT. Statistical significance in psychological research. Psychol Bull. 1968;7:151.
Lynott D, Corker KS, Wortman J, Connell L, Donnellan MB, Lucas RE, & O’Brien K. Replication of “Experiencing physical warmth promotes interpersonal warmth” by Williams and Bargh (2008). Soc Psychol. 2014;45:216–222.
Madden CS, Easley RW, Dunn MG. How journal editors view replication research. J Advert. 1995;24:78–87.
Makel MC, Plucker JA, Hegarty B. Replications in Psychology Research: How Often Do They Really Occur? Perspect Psychol Sci. 2012;7:537–42.
Mosteller F. “Comment” on Jessica Utts, “Replication and metaanalysis in parapsychology”. Statistical Science. 1991;6(4):395–396.
Neuliep JW, Crandall R. Editorial bias against replication research. J Soc Behav Pers. 1990;5:85–90.
Neuliep JW, Crandall R. Reviewer bias against replication research. J Soc Behav Pers. 1993;8:21–9.
Nosek BA, Lakens D. Registered reports: A method to increase the credibility of published results. Soc Psychol. 2014;45:137–141.
Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716.
Popper KR. The logic of scientific discovery. New York: Routledge. 1959.
Reif F. The competitive world of the pure scientist. Science. 1961;134:1957–62.
Sapolsky A. An effort at studying Rorschach content symbolism: The frog response. J Consult Psychol. 1964;28(5):469.
Schnall S, Benton J, Harvey S. With a clean conscience cleanliness reduces the severity of moral judgments. Psychol Sci. 2008;19(12):1219–22.
Schweinsberg M, Madan N, Vianello M, Sommer SA, Jordan J, Tierney W, Srinivasan M. The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline. J Exp Soc Psychol. 2016. (in press)
Simmons JP, Nelson LD, Simonsohn U. False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011;22:1359–66.
Simonsohn U, Nelson LD, Simmons JP. P-curve: a key to the file-drawer. J Exp Psychol Gen. 2014;143:534–47.
Sinclair HC, Hood K, Wright B. Revisiting the Romeo and Juliet (Driscoll, Davis, & Lipetz, 1972): Reexamining the links between social network opinions and romantic relationship outcomes. Soc Psychol. 2014;45:170–178.
Sterling TD, Rosenbaum WL, Weinkam JJ. Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. Am Stat. 1995;49(1):108–12.
Storm L, Tressoldi PE, Di Risio L. Meta-analysis of free-response studies, 1992–2008: Assessing the noise reduction model in parapsychology. Psychol Bull. 2010;136(4):471.
Taylor AE & Munafò MR (in press, this issue) Triangulating Meta-Analyses: The example of the serotonin transporter gene, stressful life events and major depression. BMC Psychology.
Van Hiel A, Onraet E, De Pauw S. The Relationship between Social‐Cultural Attitudes and Behavioral Measures of Cognitive Style: A Meta‐Analytic Integration of Studies. J Pers. 2010;78(6):1765–800.
Williams LE, Bargh JA. Experiencing physical warmth promotes interpersonal warmth. Science. 2008;322(5901):606–7.
Download references
Acknowledgements
Not applicable.
Availability of data and materials
Authors’ contributions, competing interests.
Keith R Laws is a Section Editor for BMC Psychology, who declares no competing interests.
Consent for publication
Ethics approval and consent to participate, author information, authors and affiliations.
School of Life and Medical Sciences, University of Hertfordshire, Hatfield, UK
Keith R. Laws
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Keith R. Laws .
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
Reprints and permissions
About this article
Cite this article.
Laws, K.R. Psychology, replication & beyond. BMC Psychol 4 , 30 (2016). https://doi.org/10.1186/s40359-016-0135-2
Download citation
Received : 17 May 2016
Accepted : 20 May 2016
Published : 01 June 2016
DOI : https://doi.org/10.1186/s40359-016-0135-2
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
BMC Psychology
ISSN: 2050-7283
- General enquiries: [email protected]
This page has been archived and is no longer being updated regularly.
Leaning into the replication crisis: Why you should consider conducting replication research
- Conducting Research
As a student, you may have heard buzz about the replication crisis in psychology. Some of the more sensational headlines paint a bleak picture of modern research and depending on to whom you speak, individuals can be optimistic while others are demoralized. So what is going on and is it really a crisis?
The dialogue around replication ignited in 2015 when Brian Nosek’s lab reported that after replicating 100 studies from three psychology journals, researchers were unable to reproduce a large portion of findings. This report was controversial because it called into question the validity of research shared in academic journals. Publication in high profile journals requires the research to be subjected to a rigorous peer-review process. At this point, it is assumed the conclusions shared are trustworthy and others can now replicate or build upon the work. Following the Nosek study, more labs began to conduct replications and a disturbing trend emerged: a large portion of studies across multiple disciplines in science failed the replication test.
Replication is vital to psychology because studying human behavior is messy. There are numerous extraneous variables that can result in bias if researchers are not vigilant. Replication helps verify that the presence of a behavior at one point in time is not due to chance. The report that the Open Science Collaboration (2015) put forth did not undermine the peer-review process per se; rather, it highlighted a problem within the research culture. Journals were more likely to publish innovative studies over replication studies. Following the trends of the journals, researchers who require publications in order to advance their careers are unlikely to conduct a replication. As a result, without continued investigation, the exploratory studies can be treated as established lines of research rather than fledgling inquiries.
In response to the replication crisis, more individuals have been embracing the movement of transparency in research. The Open Science Foundation (OSF) and the Society for Improving Psychological Science (SIPS) have created opportunities for researchers to brainstorm means of strengthening research practices and provide avenues to share replication results. Based on these changes, I would argue the issue of replication was not a crisis, but an awakening for researchers who had become complacent to the consequences of the toxic elements of the research culture. Highlighting the issue resulted in dialogue and change. It is a perfect example of the dynamic nature of science and captures the essence of how a career in research can be intellectually stimulating, rewarding and sometimes frustrating.
From a student perspective, engaging in replication research is a useful tool to develop your own research skills. I have found that many students have misconceptions about how to conduct research. Some common behaviors include:
- Assuming their idea is unique, but not conducting a thorough literature search to determine what is established.
- Proposing studies that are too complicated or have design faults.
- Lack of awareness of ethics or the approval process needed to conduct experiments.
- Lack of experience in regard to data entry or statistical analysis.
- Desire to change practices mid-study to increase participant compliance.
The replication movement has presented a unique opportunity for undergraduate researchers to provide meaningful contributions to science by bolstering evidence needed to substantiate exploratory findings. I teach a research seminar at Central Oregon Community College that requires teams to complete a replication study provided by the Collaborative Replication and Education Project (associated with OSF). Reviewers give feedback before and after data collection identifying problematic areas and insuring the study is an appropriate replication. Completed projects are shared on the website and exemplary studies are eligible for research badges. The process of replication requires students to slow down and analyze strategies. Over the course of the term, the student understanding of the process matures as groups question the choices of the original researchers. It is a high impact, low-risk educational environment because students learn valuable lessons whether or not the replication is successful.
Replication studies may not offer rewards for professionals, but there are direct incentives for students. Former seminar students have been able to add research experience to resumes, which, in turn, has allowed them to secure competitive positions in labs upon transfer to four-year institutions. My students have also reported feeling more prepared for upper division courses and more confidence in their abilities to conduct individual research.
If you would like to learn more about the replication movement or how you can begin a replication study, I suggest beginning with the reference below and visiting some of the websites of the organizations listed in this article.
Website for the Open Science Foundation
Website for the Society for the Improvement of Psychological Science
References
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349 (6251), aac4716. DOI: 10.1126/science.aac4716
About the author
Recommended Reading
- Login / Sign Up
Scientists often fail when they try to replicate studies. This psychologist explains why.
by Julia Belluz
For a landmark collaborative study , published today in the journal Science , researchers tried to replicate 100 recent psychology studies from top journals to see if they’d get the same results as the original authors. Overwhelmingly, they failed.
Replication — the attempt to validate previous findings by reproducing experiments — is one of the foundational ideas behind the scientific method. It tells researchers, and those who use their studies (policymakers, patients, doctors), whether research results are actually reliable or whether they might be wrong.
About 36 percent of the replications showed an effect that was consistent with the original study
In this case, about 36 percent of the replications showed an effect that was consistent with the original study. So the failure rate was more than 60 percent.
And it’s not the first time a high-profile replication effort returned concerning results — a dismal state of affairs that has led some prominent thinkers to estimate that most published scientific research findings are wrong.
But this latest study should not be read as more bad news in a distressing conversation about science’s failures, says Brian Nosek , the University of Virginia psychologist who led the effort. It is part of the Reproducibility Project: Psychology , one of many high-profile collaborative efforts to retest important research results across a range of scientific fields. The goal: to strengthen the foundation of the house of science. We talked to Nosek about the study; what follows is our conversation, lightly edited for length and clarity.
Julia Belluz: We talk a lot about the need to replicate and reproduce studies. But I think there’s little appreciation about what that actually means. Can you describe what it took to replicate 100 studies?
Brian Nosek: It is labor-intensive work. It’s easier than the original research — you don’t also have to generate the materials from nothing. But at the same time, there are challenges in understanding how another group did a study. The areas where it is a lot of work are in reviewing the methodology [the description of how the study was done] from the materials that are available, then trying to ascertain how they actually did the study. What is it that really happened?
The most interesting parts of developing these replications involved requesting original materials from the authors and comparing that against the described methodology, writing out a new methodology, and then sending that back to the original authors for their review, comments, and revisions. A lot of times in that process, researchers would say, “We actually did this thing or that thing.” It isn’t because they did something wrong, but because the norms of science are to be brief in describing the methodology.
JB: Does this mean scientists aren’t always doing a good job of writing detailed enough methodologies?
Each dot represents a study, and you can see the original study effect size versus replication effect size. The diagonal line represents the replication effect size equal to original effect size. Points below the dotted line were effects in the opposite direction of the original. ( Science )
BN: It would be great to have stronger norms about being more detailed with the methods in the paper. But even more than that, it would be great if the norm were to post procedural details as supplements in the paper. For a lot of papers, I don’t need to know those details if I’m not trying to replicate it. I’m just reading the paper, trying to learn about the outcomes. But for stuff that’s in my area — I need access to those details so I can really understand what they did. If I can rapidly get up to speed, I have a much better chance of approximating the results.
JB: Right now, there’s a tendency to think failed replications mean the original research was wrong. (We saw this with the recent discussion around the high-profile “worm wars” replication.) But as your work here shows, that’s really not necessarily the case. Can you explain that?
BN: That’s a really important point, and it applies to all research. If you have motivations or stakes in the outcome, if you have a lot of flexibility in how you analyze your data, what choices you make, political ideologies — all those things can have a subtle influence, maybe without even the intention [to game the results of the replication].
So pre-registration [putting the study design on an open database before running the study, so you can’t change the methods if you get results you don’t like] is an important feature of doing confirmatory analysis in research. That can apply to replication efforts, as well. If you’re going to reanalyze the data, or, in our case, where you’re doing a study with brand new data collection, the pre-registration process is a way to put your chips down.
JB: After helping run this massive experiment, do you have any advice for others?
"Reproducibility is hard — that's for many reasons"
BN: My main observation here is that reproducibility is hard. That’s for many reasons. Scientists are working on hard problems. They’re investigating things where we don’t know the answer. So the fact that things go wrong in the research process, meaning we don’t get to the right answer right away, is no surprise. That should be expected.
There are three reasons that a replication might get a negative result when the original got a positive result. One, the original is a false positive — the study falsely detected evidence for an effect by chance. Two, the replication is a false negative — the study falsely failed to detect evidence for an effect by chance. Three, there is a critical difference in the original and replication methodology that accounts for the difference.
JB: Can you give me an example?
BN: Imagine an original study that found a relationship between exercise and health. Researchers conclude that people who exercise more are healthier than people who do not. A replication team runs a very similar study and finds no relationship.
One and two [described above] are possibilities that one of the teams’ evidence is incorrect and the other evidence is more credible.
Three [described above] is the possibility that when the teams look closely, they realize that the original team did their study on only women and the replication team did their study on only men. Neither team realized that this might matter — the claim was that the exercise-health relationship was about people. Now that they see the results, they wonder if gender matters.
The key is that we don’t know for sure that gender matters. It could still be one or two. But we have a new hypothesis to test in a third study. And if confirmed, it would improve our understanding of the phenomenon. Was it the changes in the sample? The procedure? Being able to dig into the differences where you observe that is a way to get a better handle on the phenomenon. That’s just science doing science.
JB: We’re hearing a lot about replication efforts these days. Is it more talk than action? Or if not, which country is leading the effort?
BN: I have no sense of the place that’s leading in funding. But the US is among the places where there’s the most progress. The NIH and NSF [National Institutes of Health and National Science Foundation] have been looking into supporting replication research. And the Netherlands has had a lot of conversations about this.
But it’s definitely [more popular now]. For me, it’s a question of research efficiency. If we only value innovation, we’re going to get a lot of great ideas and very little grounding in the stability of those ideas. The effort on improving reproducibility while paying attention to fact that innovation is the primary driver of science will help us be better stewards of public funding in science and help science fulfill its promise. There aren’t better alternatives. We really need to get this right.
Most Popular
- Weather radar showed a strange blue mass in the eye of Hurricane Helene. What was it?
- The Republican Party is less white than ever. Thank Donald Trump.
- MTV’s nostalgia problem, explained by The Challenge
- 7 questions — and zero conspiracy theories — about the allegations against Sean “Diddy” Combs
- Don’t use Venmo as your checking account
Today, Explained
Understand the world with a daily explainer plus the most compelling stories of the day.
This is the title for the native ad
More in Science
This storm showed us how mega-storms are worsening. And how unprepared we are.
Inside the extremely messy, profoundly confusing fight over who should profit from non-human DNA.
Scientists are getting better at predicting the sun’s antics.
The mission tested lots of new technology for future, longer missions.
Scientists are trapped in an endless loop of grant applications. How can we set them free?
561 research papers in, the case for degrowth is still weak.
Replication is important for educational psychology: Recent developments and key issues
- CC BY-NC-ND 4.0
- Johns Hopkins University
- The University of Calgary
Discover the world's research
- 25+ million members
- 160+ million publication pages
- 2.3+ billion citations
- INT J TECHNOL DES ED
- Scott Thorne
- Nathan Mentzer
- EDUC PSYCHOL REV
- Jonathan M. Kittle
- Christina M. Budde
- Sungwha Kim
- Tiziano Piccardi
- Jake M. Hofman
- Robert West
- CONTEMP EDUC PSYCHOL
- Doug Lombardi
- J SCHOOL PSYCHOL
- Maribeth Gettinger
- Carly D. Robinson
- Will M. Gervais
- Tom E. Hardwicke
- Hannah Moshontz
- David Thomas Mellor
- Justin Reich
- Sara A. Hart
- Olivia Guest
- Andrea E. Martin
- Recruit researchers
- Join for free
- Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
- My Bibliography
- Collections
- Citation manager
Save citation to file
Email citation, add to collections.
- Create a new collection
- Add to an existing collection
Add to My Bibliography
Your saved search, create a file for external citation management software, your rss feed.
- Search in PubMed
- Search in NLM Catalog
- Add to Search
Replications in Psychology Research: How Often Do They Really Occur?
Affiliations.
- 1 Duke University [email protected].
- 2 University of Connecticut.
- 3 University of New Hampshire.
- PMID: 26168110
- DOI: 10.1177/1745691612460688
Recent controversies in psychology have spurred conversations about the nature and quality of psychological research. One topic receiving substantial attention is the role of replication in psychological science. Using the complete publication history of the 100 psychology journals with the highest 5-year impact factors, the current article provides an overview of replications in psychological research since 1900. This investigation revealed that roughly 1.6% of all psychology publications used the term replication in text. A more thorough analysis of 500 randomly selected articles revealed that only 68% of articles using the term replication were actual replications, resulting in an overall replication rate of 1.07%. Contrary to previous findings in other fields, this study found that the majority of replications in psychology journals reported similar findings to their original studies (i.e., they were successful replications). However, replications were significantly less likely to be successful when there was no overlap in authorship between the original and replicating articles. Moreover, despite numerous systemic biases, the rate at which replications are being published has increased in recent decades.
Keywords: content analysis; replication; research methodology.
© The Author(s) 2012.
PubMed Disclaimer
Similar articles
- Rewarding Replications: A Sure and Simple Way to Improve Psychological Science. Koole SL, Lakens D. Koole SL, et al. Perspect Psychol Sci. 2012 Nov;7(6):608-14. doi: 10.1177/1745691612462586. Perspect Psychol Sci. 2012. PMID: 26168120
- Is Psychological Science Self-Correcting? Citations Before and After Successful and Failed Replications. von Hippel PT. von Hippel PT. Perspect Psychol Sci. 2022 Nov;17(6):1556-1565. doi: 10.1177/17456916211072525. Epub 2022 Jun 17. Perspect Psychol Sci. 2022. PMID: 35713980 Free PMC article.
- The Psychology of Replication and Replication in Psychology. Francis G. Francis G. Perspect Psychol Sci. 2012 Nov;7(6):585-94. doi: 10.1177/1745691612459520. Perspect Psychol Sci. 2012. PMID: 26168115
- [Breast pathology: evaluation of the Portuguese scientific activity based on bibliometric indicators]. Donato HM, De Oliveira CF. Donato HM, et al. Acta Med Port. 2006 May-Jun;19(3):225-34. Epub 2006 Sep 7. Acta Med Port. 2006. PMID: 17234084 Review. Portuguese.
- Top-cited articles in endodontic journals. Fardi A, Kodonas K, Gogos C, Economides N. Fardi A, et al. J Endod. 2011 Sep;37(9):1183-90. doi: 10.1016/j.joen.2011.05.037. Epub 2011 Jul 20. J Endod. 2011. PMID: 21846531 Review.
- A guide for social science journal editors on easing into open science. Silverstein P, Elman C, Montoya A, McGillivray B, Pennington CR, Harrison CH, Steltenpohl CN, Röer JP, Corker KS, Charron LM, Elsherif M, Malicki M, Hayes-Harb R, Grinschgl S, Neal T, Evans TR, Karhulahti VM, Krenzer WLD, Belaus A, Moreau D, Burin DI, Chin E, Plomp E, Mayo-Wilson E, Lyle J, Adler JM, Bottesini JG, Lawson KM, Schmidt K, Reneau K, Vilhuber L, Waltman L, Gernsbacher MA, Plonski PE, Ghai S, Grant S, Christian TM, Ngiam W, Syed M. Silverstein P, et al. Res Integr Peer Rev. 2024 Feb 16;9(1):2. doi: 10.1186/s41073-023-00141-5. Res Integr Peer Rev. 2024. PMID: 38360805 Free PMC article.
- Risk of Psychosis Among Individuals Who Have Presented to Hospital With Self-harm: A Prospective Nationwide Register Study in Sweden. Bolhuis K, Ghirardi L, Kuja-Halkola R, Lång U, Cederlöf M, Metsala J, Corcoran P, O'Connor K, Dodd P, Larsson H, Kelleher I. Bolhuis K, et al. Schizophr Bull. 2024 Jul 27;50(4):881-890. doi: 10.1093/schbul/sbae002. Schizophr Bull. 2024. PMID: 38243843 Free PMC article.
- Preserving the Flame: The Past, Present, and Future of EJOP. Karl J. Karl J. Eur J Psychol. 2023 May 31;19(2):125-127. doi: 10.5964/ejop.11945. eCollection 2023 May. Eur J Psychol. 2023. PMID: 37731888 Free PMC article. No abstract available.
- Epidemiological characteristics and prevalence rates of research reproducibility across disciplines: A scoping review of articles published in 2018-2019. Cobey KD, Fehlmann CA, Christ Franco M, Ayala AP, Sikora L, Rice DB, Xu C, Ioannidis JPA, Lalu MM, Ménard A, Neitzel A, Nguyen B, Tsertsvadze N, Moher D. Cobey KD, et al. Elife. 2023 Jun 21;12:e78518. doi: 10.7554/eLife.78518. Elife. 2023. PMID: 37341380 Free PMC article. Review.
- Testing Replicability and Generalizability of the Time on Task Effect. Krämer RJ, Koch M, Levacher J, Schmitz F. Krämer RJ, et al. J Intell. 2023 Apr 28;11(5):82. doi: 10.3390/jintelligence11050082. J Intell. 2023. PMID: 37233332 Free PMC article.
Related information
- Cited in Books
LinkOut - more resources
Full text sources.
- Citation Manager
NCBI Literature Resources
MeSH PMC Bookshelf Disclaimer
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- v.18(3); 2020 Mar
What is replication?
Brian a. nosek.
1 Center for Open Science, Charlottesville, Virginia, United States of America
2 University of Virginia, Charlottesville, Virginia, United States of America
Timothy M. Errington
Credibility of scientific claims is established with evidence for their replicability using new data. According to common understanding, replication is repeating a study’s procedure and observing whether the prior finding recurs. This definition is intuitive, easy to apply, and incorrect. We propose that replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research. This definition reduces emphasis on operational characteristics of the study and increases emphasis on the interpretation of possible outcomes. The purpose of replication is to advance theory by confronting existing understanding with new evidence. Ironically, the value of replication may be strongest when existing understanding is weakest. Successful replication provides evidence of generalizability across the conditions that inevitably differ from the original study; Unsuccessful replication indicates that the reliability of the finding may be more constrained than recognized previously. Defining replication as a confrontation of current theoretical expectations clarifies its important, exciting, and generative role in scientific progress.
What is replication? This Perspective article proposes that the answer shifts the conception of replication from a boring, uncreative, housekeeping activity to an exciting, generative, vital contributor to research progress.
Introduction
Credibility of scientific claims is established with evidence for their replicability using new data [ 1 ]. This is distinct from retesting a claim using the same analyses and same data (usually referred to as reproducibility or computational reproducibility ) and using the same data with different analyses (usually referred to as robustness ). Recent attempts to systematically replicate published claims indicate surprisingly low success rates. For example, across 6 recent replication efforts of 190 claims in the social and behavioral sciences, 90 (47%) replicated successfully according to each study’s primary success criterion [ 2 ]. Likewise, a large-sample review of 18 candidate gene or candidate gene-by-interaction hypotheses for depression found no support for any of them [ 3 ], a particularly stunning result considering that more than 1,000 articles have investigated their effects. Replication challenges have spawned initiatives to improve research rigor and transparency such as preregistration and open data, materials, and code [ 4 – 6 ]. Simultaneously, failures-to-replicate have spurred debate about the meaning of replication and its implications for research credibility. Replications are inevitably different from the original studies. How do we decide whether something is a replication? The answer shifts the conception of replication from a boring, uncreative, housekeeping activity to an exciting, generative, vital contributor to research progress.
Replication reconsidered
According to common understanding, replication is repeating a study’s procedure and observing whether the prior finding recurs [ 7 ]. This definition of replication is intuitive, easy to apply, and incorrect.
The problem is this definition’s emphasis on repetition of the technical methods—the procedure, protocol, or manipulated and measured events. Why is that a problem? Imagine an original behavioral study was conducted in the United States in English. What if the replication is to be done in the Philippines with a Tagalog-speaking sample? To be a replication, must the materials be administered in English? With no revisions for the cultural context? If minor changes are allowed, then what counts as minor to still qualify as repeating the procedure? More broadly, it is not possible to recreate an earthquake, a supernova, the Pleistocene, or an election. If replication requires repeating the manipulated or measured events of the study, then it is not possible to conduct replications in observational research or research on past events.
The repetition of the study procedures is an appealing definition of replication because it often corresponds to what researchers do when conducting a replication—i.e., faithfully follow the original methods and procedures as closely as possible. But the reason for doing so is not because repeating procedures defines replication. Replications often repeat procedures because theories are too vague and methods too poorly understood to productively conduct replications and advance theoretical understanding otherwise [ 8 ].
Prior commentators have drawn distinctions between types of replication such as “direct” versus “conceptual” replication and argue in favor of valuing one over the other (e.g., [ 9 , 10 ]). By contrast, we argue that distinctions between “direct” and “conceptual” are at least irrelevant and possibly counterproductive for understanding replication and its role in advancing knowledge. Procedural definitions of replication are masks for underdeveloped theoretical expectations, and “conceptual replications” as they are identified in practice often fail to meet the criteria we develop here and deem essential for a test to qualify as a replication.
Replication redux
We propose an alternative definition for replication that is more inclusive of all research and more relevant for the role of replication in advancing knowledge. Replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research. This definition reduces emphasis on operational characteristics of the study and increases emphasis on the interpretation of possible outcomes.
To be a replication, 2 things must be true: outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim. The symmetry promotes replication as a mechanism for confronting prior claims with new evidence. Therefore, declaring that a study is a replication is a theoretical commitment. Replication provides the opportunity to test whether existing theories, hypotheses, or models are able to predict outcomes that have not yet been observed. Successful replications increase confidence in those models; unsuccessful replications decrease confidence and spur theoretical innovation to improve or discard the model. This does not imply that the magnitude of belief change is symmetrical for “successes” and “failures.” Prior and existing evidence inform the extent to which replication outcomes alter beliefs. However, as a theoretical commitment, replication does imply precommitment to taking all outcomes seriously.
Because replication is defined based on theoretical expectations, not everyone will agree that one study is a replication of another. Moreover, it is not always possible to make precommitments to the diagnosticity of a study as a replication, often for the simple reason that study outcomes are already known. Deciding whether studies are replications after observing the outcomes can leverage post hoc reasoning biases to dismiss “failures” as nonreplications and “successes” as diagnostic tests of the claims, or the reverse if the observer wishes to discredit the claims. This can unproductively retard research progress by dismissing replication counterevidence. Simultaneously, replications can fail to meet their intended diagnostic aims because of error or malfunction in the procedure that is only identifiable after the fact. When there is uncertainty about the status of claims and the quality of methods, there is no easy solution to distinguishing between motivated and principled reasoning about evidence. Science’s most effective solution is to replicate, again.
At its best, science minimizes the impact of ideological commitments and reasoning biases by being an open, social enterprise. To achieve that, researchers should be rewarded for articulating their theories clearly and a priori so that they can be productively confronted with evidence [ 4 , 6 ]. Better theories are those that make it clear how they can be supported and challenged by replication. Repeated replication is often necessary to resolve confidence in a claim, and, invariably, researchers will have plenty to argue about even when replication and precommitment are normative practices.
Replication resolved
The purpose of replication is to advance theory by confronting existing understanding with new evidence. Ironically, the value of replication may be strongest when existing understanding is weakest. Theory advances in fits and starts with conceptual leaps, unexpected observations, and a patchwork of evidence. That is okay; it is fuzzy at the frontiers of knowledge. The dialogue between theory and evidence facilitates identification of contours, constraints, and expectations about the phenomena under study. Replicable evidence provides anchors for that iterative process. If evidence is replicable, then theory must eventually account for it, even if only to dismiss it as irrelevant because of invalidity of the methods. For example, the claims that there are more obese people in wealthier countries compared with poorer countries on average and that people in wealthier countries live longer than people in poorer countries on average could both be highly replicable. All theoretical perspectives about the relations between wealth, obesity, and longevity would have to account for those replicable claims.
There is no such thing as exact replication. We cannot reproduce an earthquake, era, or election, but replication is not about repeating historical events. Replication is about identifying the conditions sufficient for assessing prior claims. Replication can occur in observational research when the conditions presumed essential for observing the evidence recur, such as when a new seismic event has the characteristics deemed necessary and sufficient to observe an outcome predicted by a prior theory or when a new method for reassessing a fossil offers an independent test of existing claims about that fossil. Even in experimental research, original and replication studies inevitably differ in some aspects of the sample—or units—from which data are collected, the treatments that are administered, the outcomes that are measured, and the settings in which the studies are conducted [ 11 ].
Individual studies do not provide comprehensive or definitive evidence about all conditions for observing evidence about claims. The gaps are filled with theory. A single study examines only a subset of units, treatments, outcomes, and settings. The study was conducted in a particular climate, at particular times of day, at a particular point in history, with a particular measurement method, using particular assessments, with a particular sample. Rarely do researchers limit their inference to precisely those conditions. If they did, scientific claims would be historical claims because those precise conditions will never recur. If a claim is thought to reveal a regularity about the world, then it is inevitably generalizing to situations that have not yet been observed. The fundamental question is: of the innumerable variations in units, treatments, outcomes, and settings, which ones matter? Time-of-day for data collection may be expected to be irrelevant for a claim about personality and parenting or critical for a claim about circadian rhythms and inhibition.
When theories are too immature to make clear predictions, repetition of original procedures becomes very useful. Using the same procedures is an interim solution for not having clear theoretical specification of what is needed to produce evidence about a claim. And, using the same procedures reduces uncertainty about what qualifies as evidence “consistent with” earlier claims. Replication is not about the procedures per se, but using similar procedures reduces uncertainty in the universe of possible units, treatments, outcomes, and settings that could be important for the claim.
Because there is no exact replication, every replication test assesses generalizability to the new study’s unique conditions. However, every generalizability test is not a replication. Fig 1 ‘s left panel illustrates a discovery and conditions around it to which it is potentially generalizable. The generalizability space is large because of theoretical immaturity; there are many conditions in which the claim might be supported, but failures would not discredit the original claim. Fig 1 ‘s right panel illustrates a maturing understanding of the claim. The generalizability space has shrunk because some tests identified boundary conditions (gray tests), and the replicability space has increased because successful replications and generalizations (colored tests) have improved theoretical specification for when replicability is expected.
For underspecified theories, there is a larger space for which the claim may or may not be supported—the theory does not provide clear expectations. These are generalizability tests. Testing replicability is a subset of testing generalizability. As theory specification improves (moving from left panel to right panel), usually interactively with repeated testing, the generalizability and replicability space converge. Failures-to-replicate or generalize shrink the space (dotted circle shows original plausible space). Successful replications and generalizations expand the replicability space—i.e., broadening and strengthening commitments to replicability across units, treatments, outcomes, and settings.
Successful replication provides evidence of generalizability across the conditions that inevitably differ from the original study; unsuccessful replication indicates that the reliability of the finding may be more constrained than recognized previously. Repeatedly testing replicability and generalizability across units, treatments, outcomes, and settings facilitates improvement in theoretical specificity and future prediction.
Theoretical maturation is illustrated in Fig 2 . A progressive research program (the left path) succeeds in replicating findings across conditions presumed to be irrelevant and also matures the theoretical account to more clearly distinguish conditions for which the phenomenon is expected to be observed or not observed. This is illustrated by a shrinking generalizability space in which the theory does not make clear predictions. A degenerative research program (the right path) persistently fails to replicate the findings and progressively narrows the universe of conditions to which the claim could apply. This is illustrated by shrinking generalizability and replicability space because the theory must be constrained to ever narrowing conditions [ 12 ].
With progressive success (left path) theoretical expectations mature, clarifying when replicability is expected. Also, boundary conditions become clearer, reducing the potential generalizability space. A complete theoretical account eliminates generalizability space because the theoretical expectations are so clear and precise that all tests are replication tests. With repeated failures (right path) the generalizability and replicability space both shrink, eventually to a theory so weak that it makes no commitments to replicability.
This exposes an inevitable ambiguity in failures-to-replicate. Was the original evidence a false positive or the replication a false negative, or does the replication identify a boundary condition of the claim? We can never know for certain that earlier evidence was a false positive. It is always possible that it was “real,” and we cannot identify or recreate the conditions necessary to replicate successfully. But that does not mean that all claims are true, and science cannot be self-correcting. Accumulating failures-to-replicate could result in a much narrower but more precise set of circumstances in which evidence for the claim is replicable, or it may result in failure to ever establish conditions for replicability and relegate the claim to irrelevance.
The ambiguity between disconfirming an original claim or identifying a boundary condition also means that understanding whether or not a study is a replication can change due to accumulation of knowledge. For example, the famous experiment by Otto Loewi (1936 Nobel Prize in Physiology or Medicine) showed that the inhibitory factor “vagusstoff,” subsequently determined to be acetylcholine, was released from the vagus nerve of frogs, suggesting that neurotransmission was a chemical process. Much later, after his and others’ failures-to-replicate his original claim, a crucial theoretical insight identified that the time of year at which Loewi performed his experiment was critical to its success [ 13 ]. The original study was performed with so-called winter frogs. The replication attempts performed with summer frogs failed because of seasonal sensitivity of the frog heart to the unrecognized acetylcholine, making the effects of vagal stimulation far more difficult to demonstrate. With subsequent tests providing supporting evidence, the understanding of the claim improved. What had been perceived as replications were not anymore because new evidence demonstrated that they were not studying the same thing. The theoretical understanding evolved, and subsequent replications supported the revised claims. That is not a problem, that is progress.
Replication is rare
The term “conceptual replication” has been applied to studies that use different methods to test the same question as a prior study. This is a useful research activity for advancing understanding, but many studies with this label are not replications by our definition. Recall that “to be a replication, 2 things must be true: outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim." Many "conceptual replications" meet the first criterion and fail the second. That is, they are not designed such that a failure to replicate would revise confidence in the original claim. Instead, “conceptual replications” are often generalizability tests. Failures are interpreted, at most, as identifying boundary conditions. A self-assessment of whether one is testing replicability or generalizability is answering—would an outcome inconsistent with prior findings cause me to lose confidence in the theoretical claims? If no, then it is a generalizability test.
Designing a replication with a different methodology requires understanding of the theory and methods so that any outcome is considered diagnostic evidence about the prior claim. In practice, this means that replication is often limited to relatively close adherence to original methods for topics in which theory and methodology is immature—a circumstance commonly called “direct” or “close” replication—because the similarity of methods serves as a stand-in for theoretical and measurement precision. In fact, conducting a replication of a prior claim with a different methodology can be considered a milestone for theoretical and methodological maturity.
Replication is characterized as the boring, rote, clean-up work of science. This misperception makes funders reluctant to fund it, journals reluctant to publish it, and institutions reluctant to reward it. The disincentives for replication are a likely contributor to existing challenges of credibility and replicability of published claims [ 14 ].
Defining replication as a confrontation of current theoretical expectations clarifies its important, exciting, and generative role in scientific progress. Single studies, whether they pursue novel ends or confront existing expectations, never definitively confirm or disconfirm theories. Theories make predictions; replications test those predictions. Outcomes from replications are fodder for refining, altering, or extending theory to generate new predictions. Replication is a central part of the iterative maturing cycle of description, prediction, and explanation. A shift in attitude that includes replication in funding, publication, and career opportunities will accelerate research progress.
Acknowledgments
We thank Alex Holcombe, Laura Scherer, Leonhard Held, and Don van Ravenzwaaij for comments on earlier versions of this paper, and we thank Anne Chestnut for graphic design support.
Funding Statement
This work was supported by grants from Arnold Ventures, John Templeton Foundation, Templeton World Charity Foundation, and Templeton Religion Trust. The funders had no role in the preparation of the manuscript or the decision to publish.
Provenance: Commissioned; not externally peer reviewed.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- 19 November 2018
- Correction 19 November 2018
Replication failures in psychology not due to differences in study populations
- Brian Owens
You can also search for this author in PubMed Google Scholar
A large-scale effort to replicate results in psychology research has rebuffed claims that failures to reproduce social-science findings might be down to differences in study populations.
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
24,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
185,98 € per year
only 3,65 € per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
doi: https://doi.org/10.1038/d41586-018-07474-y
Updates & Corrections
Correction 19 November 2018 : An earlier version of this story included an incorrect reference for the reproducibility paper.
Klein, R. A. et al. Preprint at PsyArXiv https://doi.org/10.31234/osf.io/9654g (2018).
Tversky, A., & Kahneman, D. Science 211 , 453–458 (1981).
Article PubMed Google Scholar
Inbar, Y., Pizarro, D., Knobe, J., & Bloom, P. Emotion 9 , 435-439 (2009).
PubMed Google Scholar
Klein, R. A. et al. Soc. Psychol. 45 , 142–152 (2014).
Article Google Scholar
Download references
Reprints and permissions
Related Articles
1,500 scientists lift the lid on reproducibility
First analysis of ‘pre-registered’ studies shows sharp rise in null findings
First results from psychology’s largest reproducibility test
Psychology’s reproducibility problem is exaggerated – say psychologists
Over half of psychology studies fail reproducibility test
Muddled meanings hamper efforts to fix reproducibility crisis
- Research data
Data integrity concerns flagged in 130 women’s health papers — all by one co-author
News 25 SEP 24
Gender inequity persists among journal chief editors
Correspondence 24 SEP 24
‘Substandard and unworthy’: why it’s time to banish bad-mannered reviews
Career Q&A 23 SEP 24
Why do we crumble under pressure? Science has the answer
News 12 SEP 24
How to change people’s minds about climate change: what the science says
News 06 SEP 24
Loss of plasticity in deep continual learning
Article 21 AUG 24
The trials and triumphs of sustainable science
Spotlight 25 SEP 24
Do AI models produce more original ideas than researchers?
News 20 SEP 24
AI’s international research networks mapped
Nature Index 18 SEP 24
Global Talent Recruitment of Xinxiang Medical University in 2024
Top-notch talents, leading talents in science and technology, and young and middle-aged outstanding talents.
Xinxiang, Henan, China
Xinxiang Medical University
Faculty Positions& Postdoctoral Research Fellow, School of Optical and Electronic Information, HUST
Job Opportunities: Leading talents, young talents, overseas outstanding young scholars, postdoctoral researchers.
Wuhan, Hubei, China
School of Optical and Electronic Information, Huazhong University of Science and Technology
Faculty Positions in Neurobiology, Westlake University
We seek exceptional candidates to lead vigorous independent research programs working in any area of neurobiology.
Hangzhou, Zhejiang, China
School of Life Sciences, Westlake University
Full-Time Faculty Member in Molecular Agrobiology at Peking University
Faculty positions in molecular agrobiology, including plant (crop) molecular biology, crop genomics and agrobiotechnology and etc.
Beijing, China
School of Advanced Agricultural Sciences, Peking University
Faculty Positions Open, ShanghaiTech University
6 major schools are now hiring faculty members.
Shanghai, China
ShanghaiTech University
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
COMMENTS
Why Replication Is Important in Psychology . When studies are replicated and achieve the same or similar results as the original study, it gives greater validity to the findings. If a researcher can replicate a study's results, it is more likely that those results can be generalized to the larger population.
Progressive research programs have successfully replicated experiments; degenerating ones have failed replicated experiments. Replication provides a means to distinguish which parts of psychology are good and progressive from those that are bad and degenerating. ... hence why replication is important in psychology generally (if not in each and ...
The crisis of confidence in psychology has prompted vigorous and persistent debate in the scientific community concerning the veracity of the findings of psychological experiments. This discussion has led to changes in psychology's approach to research, and several new initiatives have been developed, many with the aim of improving our findings. One key advancement is the marked increase in ...
Replication—an important, uncommon, and misunderstood practice—is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance ...
The idea of replication is based on the premise that there are empirical regularities or universal laws to be replicated and verified, and the scientific method is adequate for doing it. Scientific truth, however, is not absolute but relative to time, context, and the method used. Time and context are inextricably intertwined in that time (e.g ...
Abstract. Replication-an important, uncommon, and misunderstood practice-is gaining appreciation in psychology. Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can ...
But perhaps more important, 52.9% of replications were conducted by the same research team as had produced the replicated article (defined as having an overlap of at least one author, including replications from the same publication).
Conversely, a sensational research result that cannot be replicated provides information to stakeholders that may prevent unnecessary resource and opportunity costs. Replicability is therefore a cornerstone of the research endeavor in educational psychology. It tends to occur in one of two forms, direct or conceptual replications.
Unfortunately, psychology has had a tradition of avoiding replications. It has been argued that there is a greater impetus put on developing original studies than there is on replications of established work, so much so that it may even be considered risky to do so (OSC 2012).This is true across all aspects of the discipline, from journals, to funding agencies, to the ways in which scientific ...
This results in an enormous back-log of non-replicated research to contend with. ... (replication OR replicated OR replicate): Psychology Biological, Psychology, Psychology ... a personal investment in the outcome/line of research' or considered personal investment as harmful as 'it is also important that the research is designed and ...
Here's why it's so important. By Dakin Henderson Thursday, ... In psychology, an effort to replicate 100 peer-reviewed studies successfully reproduced the results for only 39. ... The research ...
Defining and Quantifying Replication Using P-values. In the original paper describing the Reproducibility Project: Psychology, a number of approaches to quantifying reproducibility were considered.The widely publicized 36% figure refers only to the percentage of study pairs that reported a statistically significant (P < 0.05) result in both the original and replication studies.
Using the complete publication history of the 100 psychology journals with the highest 5-year impact factors, the current article provides an overview of replications in psychological research since 1900. This investigation revealed that roughly 1.6% of all psychology publications used the term replication in text.
Replication in research is important because it allows for the verification and validation of study findings, building confidence in their reliability and generalizability. It also fosters scientific progress by promoting the discovery of new evidence, expanding understanding, and challenging existing theories or claims. Updated on June 30, 2023.
Modern psychology is apparently in crisis and the prevailing view is that this partly reflects an inability to replicate past findings. If a crisis does exists, then it is some kind of 'chronic' crisis, as psychologists have been censuring themselves over replicability for decades. While the debate in psychology is not new, the lack of progress across the decades is disappointing. Recently ...
However, a successful replication does not guarantee that the original scientific results of a study were correct, nor does a single failed replication conclusively refute the original claims. A failure to replicate previous results can be due to any number of factors, including the discovery of an unknown effect, inherent variability in the system, inability to control complex variables ...
The dialogue around replication ignited in 2015 when Brian Nosek's lab reported that after replicating 100 studies from three psychology journals, researchers were unable to reproduce a large portion of findings. This report was controversial because it called into question the validity of research shared in academic journals.
It is part of the Reproducibility Project: Psychology, one of many high-profile collaborative efforts to retest important research results across a range of scientific fields. The goal: to ...
Why is replication in psychology important? Replication in psychology research is important because many variables affect the human behaviors, emotions and cognitive processes that psychologists research. Since scientific research relies on consistent data patterns to draw reasonable conclusions, it's important for researchers to collect ...
Abstract. Replication is a key activity in scientific endeavors. Yet explicit replications are rare in many fields, including education and psychology. In this article, we discuss the relevance ...
This investigation revealed that roughly 1.6% of all psychology publications used the term replication in text. A more thorough analysis of 500 randomly selected articles revealed that only 68% of articles using the term replication were actual replications, resulting in an overall replication rate of 1.07%.
According to common understanding, replication is repeating a study's procedure and observing whether the prior finding recurs. This definition is intuitive, easy to apply, and incorrect. We propose that replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research.
Brian Owens. A large-scale effort to replicate results in psychology research has rebuffed claims that failures to reproduce social-science findings might be down to differences in study ...
Ioannidis (2005): "Why Most Published Research Findings Are False".[1]The replication crisis [a] is an ongoing methodological crisis in which the results of many scientific studies are difficult or impossible to reproduce.Because the reproducibility of empirical results is an essential part of the scientific method, [2] such failures undermine the credibility of theories building on them and ...