The wriggle room is fertile ground for psychologists to exploit the disjunction between belief and evidence that seems quite pervasive in psychology. As remarked upon by Francis “Contrary to its central role in other sciences,it appears that successful replication is sometimes not related to belief about an effect in experimental psychology. A high rate of successful replication is not sufficient to induce belief in an effect [8], nor is a high rate of successful replication necessary for belief [22].” The Bem [8] study documented “experimental evidence for anomalous retroactive influences on cognition and affect” or in plain language…precognition. Using multiple tasks, and nine experiments involving over 1,000 participants, Bem had implausibly demonstrated that the performance of participants reflected what happened after they had made their decision. For example, on a memory test, participants were more likely to remember words that they were later asked to practise i.e. memory rehearsal seemingly worked back in time. In another task, participants had to select which of two curtains on a computer screen hid an erotic image, and they did so at a level significantly greater than chance, but not when the hidden images were less titillating. Furthermore, Bem and colleagues [7] later meta-analysed 90 previous studies to establish a significant effect size of 0.22.
Bem presents nine replications of a phenomenon and a large meta-analysis, yet we do not believe it, while other phenomena do not so readily replicate (e.g. bystander apathy [22]) but we do believe in them. Francis [23] bleakly concludes “The scientific method is supposed to be able to reveal truths about the world, and the reliability of empirical findings is supposed to be the final arbiter of science; but this method does not seem to work in experimental psychology as it is currently practiced.” Whether we believe in Bem’s precognition, social priming, or indeed, any published psychological finding – researchers are operating within the methodological and statistical wriggle room. The task for psychologists is to view these phenomena like any other scientific question i.e. in need of explanation. If they can close-down the wriggle room, then we might expect such curios and anomalies to evaporate in a cloud of nonsignificant results.
While some might view the disjunction between belief and evidence as ‘healthy skepticism’, others might also describe it as resistance to evidence or even anti-science. A pertinent example comes from Lykken [37] who described a study in which people who see frogs in a Rorschach test – ‘frog responders’ – were more likely to have an eating disorder [48] – a finding interpreted as evidence of harboring oral impregnation fantasies and an unconscious belief in anal birth. Lykken asked 20 clinician colleagues to estimate the likelihood of this ‘cloacal theory of birth’ before and after seeing Sapolsky’s evidence. Beforehand, they reported a “…median value of 0.01, which can be interpreted to mean, roughly, ‘I don't believe it’” and after being shown the confirmatory evidence “…the median unchanged at 0.01. I interpret this consensus to mean, roughly, ‘I still don’t believe it.’” (p. 151–152). Lykken remarked that normally when a prediction is confirmed by experiment, we might expect “…a nontrivial increment in one’s confidence in that theory should result, especially when one’s prior confidence is low… [but that] this rule is wrong not only in a few exceptional instances but as it is routinely applied to the majority of experimental reports in the psychological literature” p.152. Often such claims give rise to a version of Feynman’s maxim that “Extraordinary claims require extraordinary evidence”. The remarkableness of a claim, however, is not necessarily relevant to either the type or the scale of evidence required. Instead of setting different criteria for the ordinary and extraordinary, we need to continue to close the wriggle room.
Beliefs and the failure to self-correct
“Scientists should not be in the business of simply ignoring literature that they do not like because it contests their view.” [30]
Taking this to the opposite extreme, some researchers may choose to ignore the findings of meta-analyses at the expense of selected individual studies that accord more with their view. Giner-Sorolla [24] maintained that “…meta-analytic validation is not seen as necessary to proclaim an effect reliable. Textbooks, press reports, and narrative reviews often rest conclusions on single influential articles rather than insisting on a replication across independent labs and multiple contexts” (p 564, my italics).
Stoebe & Strack rightly point-out, “Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis.” [and] ‘believers’ “…will keep on believing, pointing at the successful replications and derogating the unsuccessful ones, whereas the nonbelievers will maintain their belief system drawing on the failed replications for support of their rejection of the original hypothesis.” (p.64). Psychology rarely – if ever- proceeds with an unequivocal knock-out blow delivered by a negative finding or even a meta-analysis. Indeed, psychology often has more of the feel of trench warfare, where models and hypotheses are ultimately abandoned largely because researchers lose interest [26].
Jussim et al. [30] provide some interesting examples of precisely how social psychology doesn’t seem to correct itself when big findings fail to replicate. If doubts are raised about an original finding then as Jussim et al point out, we might expect citations to reflect this debate, the uncertainly and as such the original and the unsuccessful replications would be expected to be fairly equally cited.
In a classic study, Darley & Gross [15] found people applied a stereotype about social class when they saw a young girl taking a maths test either after seeing her playing in an affluent or poor background. After obtaining the original materials and following the procedure carefully, Baron et al. [6] published two failed replications using more than twice as many participants. Not only did they fail to replicate, the evidence was in the opposite direction. Such findings ought to encourage debate with relatively equal attention to the pro and con studies in the literature - alas no. Jussim et al. reported that “…since 1996, the original study has been cited 852 times, while the failed replications have been cited just 38 times (according to Google Scholar searches conducted on 9/11/15).”
This is not an unusual case, as Jussim et al. report several examples of failed replications not being cited, while original studies continue to be readily cited. The infamous and seminal study by Bargh and colleagues [5] showed that unconsciously priming people with an ‘elderly stereotype’ (unscrambling jumbled sentences that contained words like: old, lonely, bingo, wrinkle) makes them subsequently walk more slowly. However, Doyen et al. [16] failed to replicate the finding using more accurate measures of walking speed. Since 2013, Bargh et al. has been cited 900 times and Doyen et al. 192. Or a meta-analysis of 88 studies by Jost et al. [29] showing that conservativism is a syndrome characterized by rigidity, dogmatism, prejudice, and fear, not replicated by a larger better controlled meta-analysis conducted by Van Hiel and colleagues [57]. Since 2010, the former has been cited 1030 times while the latter a mere 60 by comparison. Jussim et al. suggest “This pattern of ignoring correctives likely leads social psychology to overstate the extent to which evidence supports the original study’s conclusions…[] it behooves researchers to grapple with the full literature, not just the studies conducive to their preferred arguments”.
Meta-analysis: rescue remedy or statistical alchemy?
Some view meta-analysis as the closest thing we have to a definitive approach for establishing the veracity and reliability of an effect. In the context of discussing social priming experiments, John Bargh [4] declared that “…In science the way to answer questions about replicability of effects is through statistical techniques such as meta-analysis”. Others are more skeptical: “Meta-analysis is a reasonable way to search for patterns in previously published research. It has serious limitations, however, as a method for confirming hypotheses and for establishing the replicability of experiments” (p. 486 Hyman, 2010). Meta-analysis is not a magic dust that we can sprinkle over primary literatures to elucidate necessary truths. Likewise totemically accumulating replicated findings, in itself, does not necessarily prove anything (pace Popper). Does it matter if we replicate a finding once, twice, or 20 times, what ratio of positive to negative outcomes do we find acceptable? Answers or rules of thumb do not exist – it often comes down to our beliefs in psychology.
This special issue of BMC Psychology contains 4 articles (Taylor & Munafo, [56]; Lakens, Hilgaard & Staaks [34]; Coppens, Verkoeijen, Bouwmeester & Rikers, [13]; Coyne [14]) and in each, meta-analysis occupies a pivotal place. As shown by Taylor & Munafo (current issue), meta analyses have proliferated, are highly cited and “…most worryingly, the perceived authority of the conclusions of a meta-analysis means that it has become possible to use a meta-analysis in the hope of having the final word in an academic debate.” As with all methods, meta-analysis has its own limitations and retrospective validation via meta-analysis is not a substitute for prospective replication using adequately powered trials, but they do have substantive role to play in the reproducibility question.
Judging the weight of evidence is never straightforward and whether a finding sustains in psychology often reflect our beliefs almost as much as the evidence. Indeed, meta-analysis rightly or wrongly enables some ideas to persist despite a lack of support at the level of individual study or trial. This has certainly been argued in the use of meta-analyses to establish a case for psychic abilities, where Storm, Tressoldi & Di Risio [55] identify how “It distorts what scientists mean by confirmatory evidence. It confuses retrospective sanctification with prospective replicability.” (p.489)
This is a kind of free-lunch’ notion of meta-analysis. Feinstein [21] even stated that “meta-analysis is analogous to statistical alchemy for the 21st century…the main appeal is that it can convert existing things into something better. “Significance” can be attained statistically when small group sizes are pooled into big ones” (p. 71). Undoubtedly, the conclusions of meta-analyses may prove unreliable where small numbers of nonsignificant trials are pooled to produce significant effects [19]. Nonetheless, it is also quite feasible for a majority of negative outcomes in a literature and still produce a reliable overall significant effect size (e.g. streptokinase: [35]).
Two of the papers presented here (Lakens et al. this issue; Taylor & Munafo this issue) offer extremely good suggestions relating to some of these conflicts in meta-analytic findings. Lakens and colleagues offer 6 recommendations, including permitting others to “re-analyze the data to examine how sensitive the results are to subjective choices such as inclusion criteria” and enabling this by providing links to data files that permit such analysis. Currently, we also need to address data sharing in regular papers. Sampling papers published in one year in the top 50 high-impact journals, Alsheikh-Ali et al. [1] reported that a substantial proportion of papers published in high-impact journals “…are either not subject to any data availability policies, or do not adhere to the data availability instructions in their respective journals”. Such efforts for transparency are extremely welcome and indeed, echo the posting online of our interactive CBT for schizophrenia meta-analysis database (http://www.cbtinschizophrenia.com/), which has been used by others to test new hypotheses (e.g. [25]).
Taylor & Munafo (this issue) advise greater triangulation of evidence and in this particular instance, supplementing traditional meta-analysis and P-curve analysis [52]. In passing, Taylor & Munafo also mention “…adversarial collaboration, where primary study authors on both sides of a particular debate contribute to an agreed protocol and work together to interpret the results”. The proposed version of adversarial collaboration proposed by Kahneman [31] urged scientists to engage in a “good-faith effort to conduct debates by carrying out joint research” (p. 729). More recently, he elaborated on this in the context of the furore over failed replications (Kahneman [32]). Coyne covers some aspects of this latest paper on replication etiquette and finds some of it wanting. It may however be possible to find some new adversarial middle ground, but it crucially depends upon psychologists being more open. Indeed, some aspects of adversarial collaboration could dovetail with Lakens et als’ proposal regarding hosting relevant data on web platforms. In such a scenario, opposing views could test their hypotheses in a public arena using a shared database.
In the context of adversarial collaboration, some uncertainty and difference of opinion exists about how we might accommodate the views of those being replicated. One possibility again requires openness and that is for those who are replicated to be asked to submit a review; and crucially, the review and replicator’s responses are then published alongside the paper. Indeed, this happened with the paper of Coppens et al. (this issue). They replicated the ‘testing effect’ reported by Carpenter (2009) – that information which has been retrieved from memory is better recalled than that which has simply been studied. Their replications and meta-analysis partially replicate the original findings, and Carpenter was one of the reviewers whose review is available alongside the paper (along with the author responses). Indeed, from its initiation, BMC Psychology has published all reviews and responses to reviewers alongside published papers. This degree of openness is unusual in psychology journals, but does offer readers a glimpse into the process behind a replication (or any paper), allows the person being replicated to contribute and comment on the replication, to reply and be published in the same journal at the same time.
Ultimately, the issues that psychologists face over replication are as much about our beliefs, biases and openness as anything else. We are not dispassionate about the outcomes that we measure. Maybe because the substance of our spotlight is people, cognition and brains, we sometimes care too much about the ‘truths’ we choose to declare. They have implications. Similarly, we should not ignore the incentive structures and conflicts between the personal goals of psychologists and the goals of science. They have implications. Finally, the attitudes of psychologists to the transparency of our science needs to change. They have implications.