This is a true story about how I tried to get a paper that I own retracted. The copyright of the paper in question is attributed to me. Several requests and demands for removal were issued, but the journal has not yet fully complied. Of course, this might be due to the fact that I did not write the paper, the paper is an incoherent mess of academic gibberish, and Prime Scholars is the publisher of the journal. This post tells my story. It ends with a call to action against Prime Scholars and similar outfits.
For those of you who are unfamiliar with Prime Scholar, this is what Wikipedia has to say about them:
Apparently, I can now count myself among the illustrious company of Hesse, Bronte, and Whitman. For I too, without my knowledge and against my wishes, am now a scholar that published a word soup essay in a Prime Scholar journal.
The paper that I never wrote
And now, the word vomit in question. The literary disaster is titled “Phenomena can be Characterized as General Patterns in Observations of Psychology Theories.” and is attributed to a single author, Noah Van Dongen. The efficiency of the review process is astounding:
1 April 2022: submission (nice touch on April fools)
4 April 2022: editor assigned
18 April 2022: reviewed
25 April 2022: revised
2 May 2022: published
About the content of the linguistic barf, just like the other papers in Issue 4 of Volume 8, it consists of a collection of English sentences that appear to approach syntactic correctness, though devoid of any clear meaning. The only thing it has going for it, is that it is blessedly short. It is good to read that there were no conflicts of interest, but it’s too bad that they misspelled my name. In the Netherlands, our surnames can have prefixes, like “van” or “van der” (which translates to “from”), which are not capitalized. My surname is spelled “van Dongen” not “Van Dongen”. Just pointing this out. However, considering the quality of the commentary, it is surprising they got so close to getting my personal information correct. Yes, this verbose vacuity is published as a commentary, though for the life of me I cannot figure out what it is supposed to be commenting on.
Trying to make sense of the senseless drivel, it seems like the true authors have taken part of the introduction from a preprint (that I did write) and ran it through a thesaurus to avoid being flagged for plagiarism, creating what is called tortured phrases. The paper that they used as the mold seems to be an early version of Productive Explanation: A Framework for Evaluating Explanations in Psychological Science, which was posted on PsyArXiv on 13 April 20223. For comparison, here are the first sentences of the original and the forgery, respectively:
In the wake of the replication crisis in psychological science, many psychologists have adopted practices that bolster the robustness and transparency of the scientific process, including preregistration (Chambers, 2013), data sharing (Wicherts et al., 2006), code sharing, and massive reproducibility studies (e.g., Aarts et al., 2015; Walters, 2020).
Right after the replication emergency in mental science, numerous clinicians have embraced rehearses that support the vigor and straightforwardness of the logical cycle, including preregistration, information sharing, code sharing, and enormous reproducibility studies.
I did not take the time to figure out which other sentences they used for the figurative butchery (or is ‘literal’ more appropriate here?) and what kind of procedure they used to end up with this collection of sentences. It looks like they just selected a certain part of the introduction, but I am not sure.
The process of getting the paper retracted and succeeding partially
As a true academic, I acted immediately, ten days later. My first attempt was emailing the journal requesting the removal of the textual horror. This was on 8 December 2022.4 I gave the journal ample time to respond (or I forgot about this problem due to other stuff that was going on at the moment). But, after three months without reply, I contacted the legal team of the University of Amsterdam. They were very understanding and wanted to be of assistance.
On 11 2023, the UvA’s legal team emailed Acta Psychopathologica demanding the retraction of the atrocious article and threatened legal action if they did not comply. Acta Psychopathologica did not respond to this email either. On 20 April, the legal team sent a reminder, to which they also did not reply. At the moment, Acta Psychologica has not responded to any of our messages.
The UvA’s legal team had also advised me to report the identity theft to the Rijksdienst Identiteitsgegevens (RVIG; translation: the identity data safety services of the Dutch government), which I did on 11 April 2023. Conveniently, the RVIG has an online form for this. However, identity theft is usually about the illegitimate use of your credit cards or passports. It was more than a bit awkward to write about how you have been impersonated to publish shoddy scientific work in (a website that calls itself a) journal. Nine days after I submitted the form, the RVIG contacted me. They were very understanding, but told me there was nothing they could do for me. They advised me to report the crime to the police.
Again, other responsibilities got in the way. About a month later, I called the police on 1 June 2023. The central operator noted down the specifics of my predicament and told me that the department responsible for identity theft would contact me to make an appointment, which they did on Saturday 3 June 2023. As it turns out, reporting identity theft must be done in person at the police station. Maybe this is to make sure that it is actually you that is reporting the theft of your identity. On 13 June 2023, I went to my appointment to report the crime, which took about an hour. The officer taking my report was friendly and understanding. Actually, she was very understanding considering the curiousness of the situation. Noteworthy about this experience, is that she wrote the report from my (first person) perspective, but in her own words. There are many new experiences I am gaining throughout this journey, and this was one of the strange ones. Reading something as if you said it, correct in terms of content, though not in a way that you would say, is surreal to say the least. You are instantly aware of your own idiolect. Or at least, that is what I experienced.
A week later, I informed the UvA’s legal team of the police report. They promptly sent another email to Acta Psychopathologica requesting them again to remove the semantic puree, though this time adding that the crime had been reported to the police and that legal action would follow.
The UvA’s legal team also contacted the Editors-in-Chief of Acta Psychopathologica to request them to remove the language pit stain. They received two replies to this request, which can be summarized as: I did not actually do anything for this journal and I would like to be removed from the editorial board.
On 7 July 2023 the horrendous word swirl was no longer visible on the website. There are now only four papers in Volume 8 Issue 4 instead of five. However, the pdf version of the paper can still be reached and I am still listed as an author on the Prime Scholars website.
What is next?
I’ve also come across this article in The Times Higher Education, which mentions legal actions being undertaken. I’m trying to find out if legal actions against Prime Scholars are indeed in the works. If so, I hope I can join them. If not, I want to get them started.
What you can do to help
- Check if you or people you know have papers attributed to them in a Prime Scholar journal.
- Contact me if you are also a victim of identity theft and want to join legal actions against Prime Scholars.
- Share this story and (ask people to) take action against Prime Scholars.
Generating academic articles that appear authentic has become much easier now large language models like ChatGPT have arrived on the scene. I think it is safe to assume that we don’t want fake papers to start invading our academic corpusses; devaluing our work and eroding the public’s trust in science. This does not seem to be a problem that will solve itself. We need to take active steps against these practices and we need to act now!
One last thought about publishers
Don’t you think that ‘respectable’ academic publishers should accept some responsibility for this predicament we find ourselves? If outfits like Prime Scholars are making a mockery of their business and start poisoning the well, should they not also undertake (legal) steps to protect their profession?
1 Also, see this article on Retraction Watch?
2 For the interested, a trifle is a layered dessert of English origin. In Friends’ episode 9 of season 6, Rachel tries to make this dessert for Thanksgiving and accidentally adds beef sautéed with peas and onion to the dessert. The result still sounds better than the paper in question.?
3 The astute reader realized that this is 12 days after it was supposedly submitted to Acta Psychopathologica. For everybody else is now also aware due to this informative footnote.?
4 I know you Americans like to put the month in front of the date, that just looks wrong.?
As more people have
started to use Bayes Factors, we should not be surprised that misconceptions
about Bayes Factors have become common. A recent study shows that the
percentage of scientific articles that draw incorrect inferences based on
observed Bayes Factors is distressingly high (Wong et al., 2022), with 92% of
articles demonstrating at least one misconception of Bayes Factors. Here I will
review some of the most common misconceptions, and how to prevent them.
Confusing Bayes Factors with Posterior Odds.
One common criticism
by Bayesians of null hypothesis significance testing (NHST) is that NHST
quantifies the probability of the data (or more extreme data), given that the
null hypothesis is true, but that scientists should be interested in the
probability that the hypothesis is true, given the data. Cohen (1994) wrote:
What’s wrong with
NHST? Well, among many other things, it does not tell us what we want to know,
and we so much want to know what we want to know that, out of desperation, we
nevertheless believe that it does! What we want to know is “Given these data,
what is the probability that Ho is true?”
One might therefore
believe that Bayes factors tell us something about the probability that a
hypothesis true, but this is incorrect. A Bayes factor quantifies how much we
should update our belief in one hypothesis. If this hypothesis was extremely
unlikely (e.g., the probability that people have telepathy) this hypothesis
might still be very unlikely, even after computing a large Bayes factor in a
single study demonstrating telepathy. If we believed the hypothesis that people
have telepathy was unlikely to be true (e.g., we thought it was 99.9% certain
telepathy was not true) evidence for telepathy might only increase our belief
in telepathy to the extent that we now believe it is 98% unlikely. The Bayes
factor only corresponds to our posterior belief if we were perfectly uncertain
about the hypothesis being true or not. If both hypotheses were equally likely,
and a Bayes factor indicates we should update our belief in such a way that the
alternative hypothesis is three times more likely than the null hypothesis,
only then would we end up believing the alternative hypothesis is exactly three
times more likely than the null hypothesis. One should therefore not conclude
that, for example, given a BF of 10, the alternative hypothesis is more likely
to be true than the null hypothesis. The correct claim is that people should
update their belief in the alternative hypothesis by a factor of 10.
Failing to interpret Bayes Factors as relative evidence.
One benefit of Bayes
factors that is often mentioned by Bayesians is that, unlike NHST, Bayes
factors can provide support for the null hypothesis, and thereby falsify
predictions. It is true that NHST can only reject the null hypothesis, although
it is important to add that in frequentist statistics equivalence tests can be
used to reject the alternative hypothesis, and therefore there is no need to
switch to Bayes factors to meaningfully interpret the results of
non-significant null hypothesis tests.
Bayes factors quantify
support for one hypothesis relative to another hypothesis. As with likelihood
ratios, it is possible that one hypothesis is supported more than another
hypothesis, while both hypotheses are actually false. It is incorrect to
interpret Bayes factors in an absolute manner, for example by stating that a
Bayes factor of 0.09 provides support for the null hypothesis. The correct
interpretation is that the Bayes factor provides relative support for H0
compared to H1. With a different alternative model, the Bayes factor would
change. As with a signiifcant equivalence tests, even a Bayes factor strongly
supporting H0 does not mean there is no effect at all – there could be a true,
but small, effect.
For example, after
Daryl Bem (2011) published 9 studies demonstrating support for pre-cognition
(conscious cognitive awareness of a future event that could not otherwise be
known) a team of Bayesian statisticians re-analyzed the studies, and concluded
“Out of the 10 critical tests, only one yields “substantial” evidence for H1,
whereas three yield “substantial” evidence in favor of H0. The results of the
remaining six tests provide evidence that is only “anecdotal”” (2011). In a
reply, Bem and Utts (2011) reply by arguing that the set of studies provide
convincing evidence for the alternative hypothesis, if the Bayes factors are
computed as relative evidence between the null hypothesis and a more
realistically specified alternative hypothesis, where the effects of
pre-cognition are expected to be small. This back and forth illustrates how
Bayes factors are relative evidence, and a change in the alternative model
specification changes whether the null or the alternative hypothesis receives
relatively more support given the data.
Not specifying the null and/or alternative model.
Given that Bayes
factors are relative evidence for or against one model compared to another
model, it might be surprising that many researchers fail to specify the alternative
model to begin with when reporting their analysis. And yet, in a systematic
review of how psychologist use Bayes factors, van de Schoot et al. (2017) found
that “31.1% of the articles did not even discuss the priors implemented”. Where
in a null hypothesis significance test researchers do not need to specify the
model that the test is based on, as the test is by definition a test against an
effect of 0, and the alternative model consists of any non-zero effect size (in
a two-sided test), this is not true when computing Bayes factors. The null
model when computing Bayes factors is often (but not necessarily) a point null
as in NHST, but the alternative model only one of many possible alternative
hypotheses that a researcher could test against. It has become common to use
‘default’ priors, but as with any heuristic, defaults will most often give an
answer to a nonsensical question, and quickly become a form of mindless
statistics. When introducing Bayes factors as an alternative to frequentist
t-tests, Rouder et al. (2009) write:
This commitment to
specify judicious and reasoned alternatives places a burden on the analyst. We
have provided default settings appropriate to generic situations. Nonetheless,
these recommendations are just that and should not be used blindly. Moreover,
analysts can and should consider their goals and expectations when specifying
priors. Simply put, principled inference is a thoughtful process that cannot be
performed by rigid adherence to defaults.
The priors used when
computing a Bayes factor should therefore be both specified and justified.
Claims based on Bayes Factors do not require error control.
In a paper with the
provocative title “Optional stopping: No problem for Bayesians” Rouder (2014)
argues that “Researchers using Bayesian methods may employ optional stopping in
their own research and may provide Bayesian analysis of secondary data regardless
of the employed stopping rule.” If one would merely read the title and
abstract, a reader might come to the conclusion that Bayes factors a wonderful
solution to the error inflation due to optional stopping in the frequentist
framework, but this is not correct (de Heide & Grünwald, 2017).
There is a big caveat
about the type of statistical inferences that is unaffected by optional
stopping. Optional stopping is no problem for Bayesians if they refrain from
making a dichotomous claim about the presence or absence of an effect, or when
they refrain from drawing conclusions about a prediction being supported or
falsified. Rouder notes how “Even with optional stopping, a researcher can
interpret the posterior odds as updated beliefs about hypotheses in light of
data.” In other words, even after optional stopping, a Bayes factor tells
researchers who much they should update their belief in a hypothesis.
Importantly, when researchers make dichotomous claims based on Bayes factors
(e.g., “The effect did not differ significantly between the condition, BF10 =
0.17”) then this claim can be correct, or an error, and error rates become a
relevant consideration, unlike when researchers simply present the Bayes factor
for readers to update their personal beliefs.
among each other about whether Bayes factors should be the basis of dichotomous
claims, or not. Those who promote the use of Bayes factors to make claims often
refer to thresholds proposed by Jeffreys (1939), where a BF > 3 is
“substantial evidence”, and a BF > 10 is considered “strong evidence”. Some
journals, such as Nature Human Behavior, have the following requirement for
researchers who submit a Registered Report: “For inference by Bayes factors,
authors must be able to guarantee data collection until the Bayes factor is at
least 10 times in favour of the experimental hypothesis over the null
hypothesis (or vice versa).” When researchers decide to collect data until a
specific threshold is crossed to make a claim about a test, their claim can be
correct, or wrong, just as when p-values are the statistical quantity a claim
is based on. As both the Bayes factor and the p-value can be computed based on
the sample size and the t-value (Francis, 2016; Rouder et al., 2009), there is
nothing special about using Bayes factors as the basis of an ordinal claim. The
exact long run error rates can not be directly controlled when computing Bayes
factors, and the Type 1 and Type 2 error rate depends on the choice of the
prior and the choice for the cut-off used to decide to make a claim.
Simulations studies show that for commonly used priors and a BF > 3 cut-off
to make claims the Type 1 error rate is somewhat smaller, but the Type 2 error
rate is considerably larger (Kelter, 2021).
To conclude this
section, whenever researchers make claims, they can make erroneous claims, and
error control should be a worthy goal. Error control is not a consideration
when researchers do not make ordinal claims (e.g., X is larger than Y, there is
a non-zero correlation between X and Y, etc). If Bayes factors are used to
quantify how much researchers should update personal beliefs in a hypothesis,
there is no need to consider error control, but researchers should also refrain
from making any ordinal claims based on Bayes factors in the results section or
the discussion section. Giving up error control also means giving up claims
about the presence or absence of effects.
Interpret Bayes Factors as effect sizes.
Bayes factors are not
statements about the size of an effect. It is therefore not appropriate to
conclude that the effect size is small or large purely based on the Bayes
factor. Depending on the priors used when specifying the alternative and null
model, the same Bayes factor can be observed for very different effect size
estimates. The reverse is also true. The same effect size can correspond to
Bayes factors supporting the null or the alternative hypothesis, depending on
how the null model and the alternative model are specified. Researchers should
therefore always report and interpret effect size measure. Statements about the
size of effects should only be based on these effect size measures, and not on
Any tool for
statistical inferences will be mis-used, and the greater the adoption, the more
people will use a tool without proper training. Simplistic sales pitches for
Bayes factors (e.g., Bayes factors tell you the probability that your
hypothesis is true, Bayes factors do not require error control, you can use
‘default’ Bayes factors and do not have to think about your priors) contribute
to this misuse. When reviewing papers that report Bayes factors, check if the
authors use Bayes factors to draw correct inferences.
Bem, D. J. (2011).
Feeling the future: Experimental evidence for anomalous retroactive influences
on cognition and affect. Journal of Personality and Social Psychology, 100(3),
Bem, D. J., Utts, J.,
& Johnson, W. O. (2011). Must psychologists change the way they analyze
their data? Journal of Personality and Social Psychology, 101(4), 716–719.
Cohen, J. (1994). The
earth is round (p .05). American Psychologist, 49(12), 997–1003.
de Heide, R., &
Grünwald, P. D. (2017). Why optional stopping is a problem for Bayesians.
arXiv:1708.08278 [Math, Stat]. https://arxiv.org/abs/1708.08278
Francis, G. (2016).
Equivalent statistics and data interpretation. Behavior Research Methods, 1–15.
Jeffreys, H. (1939).
Theory of probability (1st ed). Oxford University Press.
Kelter, R. (2021).
Analysis of type I and II error rates of Bayesian and frequentist parametric
and nonparametric two-sample hypothesis tests under preliminary assessment of
normality. Computational Statistics, 36(2), 1263–1288.
Rouder, J. N. (2014).
Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review,
Rouder, J. N.,
Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t
tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin
& Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
van de Schoot, R.,
Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A
systematic review of Bayesian articles in psychology: The last 25 years.
Psychological Methods, 22(2), 217–239. https://doi.org/10.1037/met0000100
Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why
psychologists must change the way they analyze their data: The case of psi:
Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3),
Wong, T. K., Kiers,
H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function
of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1),
Together with my co-host Smriti Mehta we’ve started a new podcast: Nullius in Verba. It’s a podcast about science – what it is, and what it could be. The introduction episode is up now, and new episodes will be released every other week starting this friday!
You can subscribe using clicking the links below:
or by adding this RSS feed to your podcast player:
We will release episodes on friday every other week (starting this friday). Topics in the first episodes are inspired by aphorisms in Francis Bacon’s ‘Novum Organum’ (the galleon in our logo comes from the title page of his 1620 book, where it is passing between the mythical Pillars of Hercules that stand either side of the Strait of Gibraltar, which have been smashed through by Iberian sailors, opening a new world for exploration and marking the exit from the well-charted waters of the Mediterranean into the Atlantic Ocean. Bacon hoped that empirical investigation will similarly smash the old scientific ideas and lead to greater understanding of nature and the world.
As we explain in the introduction, the title of the podcast comes from the motto of the Royal Society.
Our logo is set in typeface Kepler by Robert Slimbach. Our theme song is Newton’s Cradle by Grandbrothers. You see we are going all in on subtle science references.
We hope you’ll enjoy listening along as we discuss themes like confirmation bias, skepticism, eminence, the ‘itch to publish’ and in the first episode, the motivations to do science.
This blog post is based on the chapter “Critical Levels, Statistical Language, and Scientific Inference” by Irwin D. J. Bross (1971) in the proceedings of the symposium on the foundations of statistical inference held in 1970. Because the conference proceedings might be difficult to access, I am citing extensively from the original source. Irwin D. J. Bross [1921-2004] was a biostatistician at Roswell Park Cancer Institute in Buffalo up to 1983.
Irwin D. J. Bross
Criticizing the use of thresholds such as an alpha level of 0.05 to make dichotomous inferences is of all times. Bross writes in 1981: “Of late the attacks on critical levels (and the statistical methods based on these levels) have become more frequent and more vehement.” I feel the same way, but it seems unlikely the vehemence of criticisms has been increasing for half a century. A more likely explanation is perhaps that some people, like Bross and myself, become increasingly annoyed by such criticisms.
Bross reflects on how very few justifications of the use of alpha levels exists in the literature, because “Elementary statistics texts are not equipped to go into the matter; advanced texts are too preoccupied with the latest and fanciest statistical techniques to have space for anything so elementary. Thus the justifications for critical levels that are commonly offered are flimsy, superficial, and badly outdated.” He notes how the use of critical values emerged in a time where statisticians had more practical experience, but that “Unfortunately, many of the theorists nowadays have lost touch with statistical practice and as a consequence, their work is mathematically sophisticated but scientifically very naive.”
Bross sets out to consider which justification can be given for the use of critical alpha levels. He would like such a justification to convince those who use statistical methods in their research, and statisticians who are familiar with statistical practice. He argues that the purpose of a medical researcher “is to reach scientific conclusions concerning the relative efficacy (or safety) of the drugs under test. From the investigator’s standpoint, he would like to make statements which are reliable and informative. From a somewhat broader standpoint, we can consider the larger communication network that exists in a given research area – the network which would connect a clinical pharmacologist with his colleagues and with the practicing physicians who might make use of his findings. Any realistic picture of scientific inference must take some account of the communication networks that exist in the sciences.”
It is rare to see biostatisticians explicitly embrace the pragmatic and social aspect of scientific inference. He highlights three main points about communication networks. “First, messages generate messages.” Colleagues might replicate a study, or build on it, or apply the knowledge. “A second key point is: discordant messages produce noise in the network”. “A third key point is: statistical methods are useful in controlling the noise in the network”. The critical level set by researchers controls the noise in the network. Too much noise in a network impedes scientific progress, because communication breaks down. He writes “Thus the specification of the critical levels […] has proved in practice to be an effective method for controlling the noise in communication networks.” Bross also notes that the critical alpha level in itself is not enough to reduce noise – it is just one component of a well-designed experiment that reduces noise in the network. Setting a sufficiently low alpha level is therefore one aspect that contributes to a system where people in the network can place some reliance on claims that are made because noise levels are not too high.
“This simple example serves to bring out several features of the usual critical level techniques which are often overlooked although they are important in practice. Clearly, if each investigator holds the proportion of false positive reports in his studies at under 5%, then the proportion of false positive reports from all of the studies carried out by the participating members of the network will be held at less than 5%. This property does not sound very impressive – it sounds like the sort of property one would expect any sensible statistical method to have. But it might be noted that most of the methods advocated by the theoreticians who object to critical levels lack this and other important properties which facilitate control of the noise level in the network.”
This point, common among all error statisticians, has repeatedly raised its head in response to suggestions to abandon statistical significance, or to stop interpreting p-values dichotomously. Of course, one can choose not to care about the rate at which researchers make erroneous claims, but it is important to realize the consequences of not caring about error rates. Of course, one can work towards a science where scientists no longer make claims, but generate knowledge through some other mechanism. But recent proposals to treat p-values as continuous measures of evidence (Muff et al., 2021; but see Lakens, 2022a) or to use estimation instead of hypothesis tests (Elkins et al., 2021; but see Lakens, 2022b) do not outline what such an alternative mode of knowledge generation would look like, or how researchers will be prevented from making claims about the presence or absence of effects.
Bross proposes an intriguing view on statistical inference where “statistics is used as a means of communication between the members of the network.” He views statistics not as a way to learn whether idealized probability distributions accurately reflect the empirical reality in infinitely repeated samples, but as a way to communicate assertions that are limited by the facts. He argues that specific ways of communicating only become widely used if they are effective. Here, I believe he fails to acknowledge that ineffective communication systems can also evolve, and it is possible that scientists en masse use techniques, not because they are efficient ways of communicating facts, but because they will lead to scientific publications. The idea that statistical inferences are a ‘mindless ritual’ has been proposed (Gigerenzer, 2018), and there is no doubt that many scientists simply imitate the practices they see. Furthermore, the replication crisis has shown that huge error rates in subfields of the scientific literature can exists for decades. The problems associated with these large error rates (e.g., failures to replicate findings, inaccurate effect size estimate) can sometimes only very slowly lead to a change in practice. So, arguing a practice survives because it works is risky. Whether current research practices are effective – or whether other practices would be more effective – requires empirical evidence. Randomized controlled trials seem a bridge to far to compare statistical approaches, but natural experiments by journals that abandon p-values support Bross’s argument to some extent. When the journal of Basic and Applied Psychology abandoned p-values, the consequence was that researchers claimed effects were present at a higher error rate than if claims had been limited by typical alpha level thresholds (Fricker et al., 2019).
Whether efficient or not, statistical language surrounding critical thresholds is in widespread use. Bross discusses how an alpha level of 5% or 1% is a convention. Like many linguistic conventions, the threshold of 5% is somewhat arbitrary, and reflects the influence of statisticians like Karl Pearson and Ronald Fisher. However, the threshold is not completely arbitrary. Bross asks us to imagine what would have happened had the alpha level of 0.001 been proposed, or an alpha level of 0.20. In both cases, he believes the convention would not have spread – in the first case because in many fields there are not sufficient resources to make claims at such a low error rate, and in the second case because few researchers would have found that alpha level a satisfactory quantification of ‘rare’ events. He, I think correctly, observes this means that the alpha level of 0.05 is not completely arbitrary, but that it reflects a quantification of ‘rare’ that researchers believe has sufficient practical value to be used in communication. Bross argues that the convention of a 5% alpha level spread because it sufficiently matched with what most scientists considered as an appropriate probability level to define ‘rare’ events, for example as when Fisher (1956) writes “Either an exceptionally rare chance has occurred, or the theory of random distribution is not true.”
Of course, one might follow Bross and ask “But is there any reason to single out one particular value, 5%, instead of some other value such as 3.98% or 7.13%?”. He writes: “Again, inconformity with the linguistic patterns in setting conventions it is natural to use a round number like 5%. Such round numbers serve to avoid any suggestion that the critical value has been gerrymandered or otherwise picked to prove a point in a particular study.From an abstract standpoint, it might seem more logical to allow a range of critical values rather than to choose one number but to do so would be to go contrary to the linguistic habits of fact-limited languages. Such languages tend to minimize the freedom of choice of the speaker in order to insure that a statement results from factual evidence and not from a little language-game played by the speaker.”
In a recent paper we made a similar point (Uygun Tunç et al., 2021): “The conventional use of an alpha level of 0.05 can also be explained by the requirement in methodological falsificationism that statistical decision procedures are specified before the data is analyzed (see Popper, 2002, sections 19-20; Lakatos, 1978, p. 23-28). If researchers are allowed to set the alpha level after looking at the data there is a possibility that confirmation bias (or more intentional falsification-deflecting strategies) influences the choice of an alpha level. An additional reason for a conventional alpha level of 0.05 is that before the rise of the internet it was difficult to transparently communicate the pre-specified alpha level for any individual test to peers. The use of a default alpha level therefore effectively functioned as a pre-specification. For a convention to work as a universal pre-specification, it must be accepted by nearly everyone, and be extremely resistant to change. If more than a single conventional alpha level exists, this introduces the risk that confirmation bias influences the choice of an alpha level.” Our thoughts seem to be very much aligned with those of Bross.
Bross continues and writes “Anyone familiar with certain areas of the scientific literature will be well aware of the need for curtailing language-games. Thus if there were no 5% level firmly established, then some persons would stretch the level to 6% or 7% to prove their point. Soon others would be stretching to 10% and 15% and the jargon would become meaningless. Whereas nowadays a phrase such as statistically significant difference provides some assurance that the results are not merely a manifestation of sampling variation, the phrase would mean very little if everyone played language-games. To be sure, there are always a few folks who fiddle with significance levels-who will switch from two-tailed to one-tailed tests or from one significance test to another in an effort to get positive results. However such gamesmanship is severely frowned upon and is rarely practiced by persons who are native speakers of fact-limited scientific languages– it is the mark of an amateur.”
We struggled with the idea that changing the alpha level (and especially increasing the alpha level) might confuse readers when ‘statistically significant’ no longer means ‘rejected with a 5% alpha level) in our recent paper on justifying alpha levels (Maier & Lakens, 2022). We wrote “Finally, the use of a high alpha level might be missed if readers skim an article. We believe this can be avoided by having each scientific claim accompanied by the alpha level under which it was made. Scientists should be required to report their alpha levels prominently, usually in the abstract of an article alongside a summary of the main claim.” It might in general be an improvement if people write ‘we reject an effect size of X at an alpha level of 5%’, but this is especially true of researchers choose to deviate from the conventional 5% alpha level.
I like how Bross has an extremely pragmatic but still principled view on statistical inferences. He writes: “This means that we have to abandon the traditional prescriptive attitude and adopt the descriptiveapproach which is characteristic of the empirical sciences. If we do so, then we get a very different picture of what statistical and scientific inference is all about. It is very difficult, I believe, to get such a picture unless you have had some first hand experience as an independent investigator in a scientific study. You then learn that drawing conclusions from statistical data can be a traumatic experience. A statistical consultant who takes a detached view of things just does not feel this pain.” This point is made by other applied statisticians, and it is a really important one. There are consequences of statistical inferences that you only experience when you spend several years trying to answer a substantive research question. Without that experience, it is difficult to give practical recommendations about what researchers should want when using statistics.
He continues “What you want – and want desperately – is all the protection you can get against the “slings and arrows of outrageous fortune”. You want to say something informative and useful about the origins and nature of a disease or healthhazard. But you do not want your statement to come back and haunt you for the rest of your life.” Of course, Bross should have said ‘One thing you might want’ (because now his statement is just another example of The Statistician’s Fallacy (Lakens, 2021)). But with this small amendment, I think there are quite some scientists who want this from their statistical inferences. He writes “When you announce a new finding, you put your scientific reputation on the line. You colleagues probably cannot remember all your achievements, but they will never forget any of your mistakes! Second thoughts like these produce an acute sense of insecurity.” Not all scientists might feel like this, I hope we are willing to forget some mistakes people make, and I fear the consequences of making too many incorrect claims on the reputation of a researcher is not as severe as Bross suggests[i]. But I think many fellow researchers will experience some fear their findings do not hold up (at least until they have been replicated several times), and that hearing researchers failed to replicate a finding yields some negative affect.
I think Bross hits the nail on the head when it comes to thinking about justifications of the use of alpha levels as thresholds to make claims. The justification for this practice is social and pragmatic in nature, not statistical (cf. Uygun Tunç et al., 2021). If we want to evaluate if current practices are useful or not, we have to abandon a prescriptive approach, and rely on a descriptive approach (Bross, p. 511). Anyone proposing an alternative to the use of alpha levels should not make prescriptive arguments, but provide descriptive data (or at least predictions) that highlight how their preferred approach to statistical inferences will improve communication between scientists.
Bross, I. D. (1971). Critical levels, statistical language and scientific inference. In Foundations of statistical inference (pp. 500–513). Holt, Rinehart and Winston.
Elkins, M. R., Pinto, R. Z., Verhagen, A., Grygorowicz, M., Söderlund, A., Guemann, M., Gómez-Conesa, A., Blanton, S., Brismée, J.-M., Ardern, C., Agarwal, S., Jette, A., Karstens, S., Harms, M., Verheyden, G., & Sheikh, U. (2021). Statistical inference through estimation: Recommendations from the International Society of Physiotherapy Journal Editors. Journal of Physiotherapy. https://doi.org/10.1016/j.jphys.2021.12.001
Fisher, R. A. (1956). Statistical methods and scientific inference (Vol. viii). Hafner Publishing Co.
Fricker, R. D., Burke, K., Han, X., & Woodall, W. H. (2019). Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban. The American Statistician, 73(sup1), 374–384. https://doi.org/10.1080/00031305.2018.1537892
Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 1(2), 198–218. https://doi.org/10.1177/2515245918771329
Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012
Lakens, D. (2022a). Why P values are not measures of evidence. Trends in Ecology & Evolution. https://doi.org/10.1016/j.tree.2021.12.006
Lakens, D. (2022b). Correspondence: Reward, but do not yet require, interval hypothesis tests. Journal of Physiotherapy, 68(3), 213–214. https://doi.org/10.1016/j.jphys.2022.06.004
Maier, M., & Lakens, D. (2022). Justify Your Alpha: A Primer on Two Practical Approaches. Advances in Methods and Practices in Psychological Science, 5(2), 25152459221080396. https://doi.org/10.1177/25152459221080396
Muff, S., Nilsen, E. B., O’Hara, R. B., & Nater, C. R. (2021). Rewriting results sections in the language of evidence. Trends in Ecology & Evolution. https://doi.org/10.1016/j.tree.2021.10.009
Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by
[i] I recently saw the bio of a social psychologist who has produces a depressingly large number of incorrect claims in the literature. His bio made no mention of this fact, but proudly boasted about the thousands of times he was cited, even though most citations were for work that did not survive the replication crisis. How much reputations should suffer is an intriguing question that I think too many scientists will never feel comfortable addressing.
In 1955 Tukey gave a dinner talk about the difference between decisions and conclusions at a meeting of the Section of Physical and Engineering Science of the American Statistical Association. The talk was published in 1960. The distinction relates directly to different goals researchers might have when they collect data. This blog is largely a summary of his paper.
Tukey was concerned about the ‘tendency of decision theory to attempt to conquest all of statistics’. In hindsight, he needn’t have worried. In the social sciences, most statistics textbooks do not even discuss decision theory. His goal was to distinguish decisions from conclusions, to carve out a space for ‘conclusion theory’ to complement decision theory. He distinguishes decisions from conclusions.
In practice, making a decision means to ‘decide to act for the present as if’. Possible actions are defined, possible states of nature identified, and we make an inference about each state of nature. Decisions can be made even when we remain extremely uncertain about any ‘truth’. Indeed, in extreme cases we can even make decisions without access to any data. We might even decide to act as if two mutually exclusive states of nature are true! For example, we might buy a train ticket for a holiday three months from now, but also take out life insurance in case we die tomorrow.
Conclusions differ from decisions. First, conclusions are established without taking consequences into consideration. Second, conclusions are used to build up a ‘fairly well-established body of knowledge’. As Tukey writes: “A conclusion is a statement which is to be accepted as applicable to the conditions of an experiment or observation unless and until unusually strong evidence to the contrary arises.” A conclusion is not a decision on how to act in the present. Conclusions are to be accepted, and thereby incorporated into what Frick (1996) calls a ‘corpus of findings’. According to Tukey, conclusions are used to narrow down the number of working hypotheses still considered consistent with observations. Conclusions should be reached, not based on their consequences, but because of their lasting (but not everlasting, as conclusions can now and then be overturned by new evidence) contribution to scientific knowledge.
Tests of hypotheses
According to Tukey, a test of hypotheses can have two functions. The first function is as a decision procedure, and the second function is to reach a conclusion. In a decision procedure the goal is to choose a course of action given an acceptable risk. This risk can be high. For example, a researcher might decide not to pursue a research idea after a first study, designed to have 80% power for a smallest effect size of interest, yields a non-significant result. The error rate is at most 20%, but the researcher might have enough good research ideas to not care.
The second function is to reach a conclusion. This is done, according to Tukey, by controlling the Type 1 and Type 2 error rate at ‘suitably low levels’ (Note: Tukey’s discussion of concluding an effect is absent is hindered somewhat by the fact that equivalence tests were not yet widely established in 1955 – Hodges & Lehman’s paper appeared in 1954). Low error rates, such as the conventions to use a 5% of 1% alpha level, are needed to draw conclusions that can enter the corpus of findings (even though some of these conclusions will turn out to be wrong, in the long run).
Why would we need conclusions?
One might reasonably wonder if we need conclusions in science. Tukey also ponders this question in Appendix 2. He writes “Science, in the broadest sense, is both one of the most successful of human affairs, and one of the most decentralized. In principle, each of us puts his evidence (his observations, experimental or not, and their discussion) before all the others, and in due course an adequate consensus of opinion develops.” He argues not for an epistemological reason, nor for a statistical reason, but for a sociological reason. Tukey writes: “There are four types of difficulty, then, ranging from communication through assessment to mathematical treatment, each of which by itself will be sufficient, for a long time, to prevent the replacement, in science, of the system of conclusions by a system based more closely on today’s decision theory.” He notes how scientists can no longer get together in a single room (as was somewhat possible in the early decades of the Royal Society of London) to reach consensus about decisions. Therefore, they need to communicate conclusions, as “In order to replace conclusions as the basic means of communication, it would be necessary to rearrange and replan the entire fabric of science.”
I hadn’t read Tukey’s paper when we wrote our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests”. In this preprint, we also discuss a sociological reason for the presence of dichotomous claims in science. We also ask: “Would it be possible to organize science in a way that relies less on tests of competing theories to arrive at intersubjectively established facts about phenomena?” and similarly conclude: “Such alternative approaches seem feasible if stakeholders agree on the research questions that need to be investigated, and methods to be utilized, and coordinate their research efforts”. We should add a citation to Tukey’s 1960 paper.
Is the goal of an study a conclusion, a decision, or both?
Tukey writes he “looks forward to the day when the history and status of tests of hypotheses will have been disentangled.” I think that in 2022 that day has not yet come. At the same time, Tukey admits in Appendix 1 that the two are sometimes intertwined.
A situation Tukey does not discuss, but that I think is especially difficult to disentangle, is a cumulative line of research. Although I would prefer to only build on an established corpus of findings, this is simply not possible. Not all conclusions in the current literature are reached with low error rates. This is true both for claims about the absence of an effect (which are rarely based on an equivalence test against a smallest effect size of interest with a low error rate), as for claims about the presence of an effect, not just because of p-hacking, but also because I might want to build on an exploratory finding from a previous study. In such cases, I would like to be able to concludethe effects I build on are established findings, but more often than not, I have to decide these effects are worth building on. The same holds for choices about the design of a set of studies in a research line. I might decide to include a factor in a subsequent study, or drop it. These decisions are based on conclusions with low error rates if I had the resources to collect large samples and perform replication studies, but other times they involve decisions about how to act in my next study with quite considerable risk.
We allow researchers to publish feasibility studies, pilot studies, and exploratory studies. We don’t require every study to be a Registered Report of Phase 3 trial. Not all information in the literature that we build on has been established with the rigor Tukey associates with conclusions. And the replication crisis has taught us that more conclusions from the past are later rejected than we might have thought based on the alpha levels reported in the original articles. And in some research areas, where data is scarce, we might need to accept that, if we want to learn anything, the conclusions will always more tentative (and the error rates accepted in individual studies will be higher) than in research areas where data is abundant.
Even if decisions and conclusions can not be completely disentangled, reflecting on their relative differences is very useful, as I think it can help us to clarify the goal we have when we collect data.
For a 2013 blog post by Justin Esarey, who found the distinction a bit less useful than I found it, see https://polmeth.org/blog/scientific-conclusions-versus-scientific-decisions-or-we%E2%80%99re-having-tukey-thanksgiving
Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379
Tukey, J. W. (1960). Conclusions vs decisions. Technometrics, 2(4), 423–433.
Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by
Recently a new category of studies have started to appear in the psychological literature that provide the strongest support to date for a replication crisis in psychology: Large scale collaborative replication studies where the authors of the original study are directly involved in the study. These replication studies have often provided conclusive demonstrations of the absence of any effect large enough to matter. Despite considerable attention for these extremely interesting projects, I don’t think the scientific community has fully appreciated what we have learned from these studies.
Three examples of Collaborative Author Involved Replication Studies
Vohs and colleagues (2021) performed a multi-lab replication study of the ego-depletion effect, which (deservedly) has become a poster child of non-replicable effects in psychology. The teams used different combinations of protocols, allowing an unsuccessful prediction to generalize across minor variations in how the experiment was operationalized. Across these conditions, a non-significant effect was observed of d = 0.06, 95%CI[-0.02;0.14]. Although the authors regrettably did not specify a smallest effect size of interest in their frequentist analyses, they mention “we pitted a point-null hypothesis, which states that the effect is absent, against an informed one-sided alternative hypothesis centered on a depletion effect (?) of 0.30 with a standard deviation of 0.15” in their Bayesian analyses. Based on the confidence interval, we can reject effects of d = 0.3, and even d = 0.2, suggesting that we have extremely informative data concerning the absence of an effect most ego-depletion researchers would consider is large enough to matter.
Morey et al (2021) performed a multi-lab replication study of the Action-Sentence Compatibility effect (Glenberg & Kaschak, 2002). I cited the original paper in my PhD thesis, and it was an important finding that I built on, so I was happy to join this project. As written in the replication study, the original team, together with the original authors, “established and pre-registered ranges of effects on RT that we would deem (a) uninteresting and inconsistent with the ACE theory: less than 50 ms.” An effect between 50 ms and 100 ms was seen as inconsistent with the previous literature, but in line with predictions of the ACE effect. The replication study consisted (after exclusions) of 903 native English speakers, and 375 non-native English speakers. The original study had used 44, 70, and 72 participants across 3 studies. The conclusion in the replication study was that the median ACE interactions were close to 0 and all within the range that we pre-specified as negligible and inconsistent with the existing ACE literature. There was little heterogeneity.
Last week, Many Labs 4 was published (Klein et al., 2022). This study was designed to examine the mortality salience effect (which I think deserve the same poster child status of a non-replicable effect in psychology, but which seems to have gotten less attention so far). Data from 1550 participants was collected across 17 labs, some which performed the study with involvement of the original author, and some which did not. Several variations of the analyses were preregistered, but none revealed the predicted effect, Hedges’ g = 0.07, 95% CI = [-0.03, 0.17] (for exclusion set 1). The authors did not provide a formal sample size justification based on a smallest effect size of interest, but in a sensitivity power analysis indicate they had 95% power for effect sizes of d = 0.18 to d = 0.21. If we assume all authors found effect sizes around d = 0.2 small enough to no longer support their predictions, we can see based on the confidence intervals that we can indeed exclude effect sizes large enough to matter. The mortality salience effect, even with involvement of the original authors, seems to be too small to matter. There was little heterogeneity in effect sizes (in part because the absence of an effect).
These are just three examples (there are more, of which the multi-lab test of the facial feedback hypothesis by Coles et al., 2022, is worth highlighting), but they highlight some interesting properties of collaborative author involved replication studies. I will highlight four strengths of these studies.
Four strengths of Collaborative Author Involved Replication Studies
1) The original authors are extensively involved in the design of the study. They sign off on the final design, and agree that the study is, with the knowledge they currently have, the best test of their prediction. This means the studies tell us something about the predictive validity of state of the art knowledge in a specific field. If the predictions these researchers make are not corroborated, the knowledge we have accumulated in these research areas are is not reliable enough to make successful predictions.
2) The studies are not always direct replications, but the best possible test of the hypothesis, in the eyes of the researchers involved. Criticism on past replication studies has been that directly replicating a study performed many years ago is not always insightful, as the context has changed (even though Many Labs 5 found no support for this criticism). In this new category of collaborative author involved replication studies, the original authors are free to design the best possible test of their prediction. If these tests fail, we can not attribute the failure to replicate to the ‘protective belt’ of auxiliary hypotheses that no longer hold. Of course, it is possible that the theory can be adjusted in a constructive manner after this unsuccessful prediction. But at this moment, these original authors do not have a solid understanding of their research topic to be able to predict if an effect will be observed.
3) The other researchers involved in these projects often have extensive expertise in the content area. They are not just researchers interested in mechanistically performing a replication study on a topic they have little expertise with. Instead, many of the researchers consists of peers who have worked in a specific research area, published on the topic of the replication study, but have collectively developed some doubts about the reliability of past claims, and have decided to spend some of their time replicating a previous finding.
4) The statistical analyses in these studies yield informative conclusions. The studies typically do not conclude the prediction was unsuccessful based on p> 0.05 in a small sample. In the most informative studies, original authors have explicitly specified a smallest effect size of interest, which makes it possible to perform an equivalence test, and statistically reject the presence of any effect deemed large enough to matter. In other cases, Bayesian hypothesis tests are performed which provide support for the null, compared to the alternative, model. This makes these replications studies severe tests of the predicted effect. In cases where original authors did not specify a smallest effect size of interest, the very large sample sizes allow readers to examine effects that can be rejected based on the observed confidence interval, and in all the studies discussed here, we can reject the presence of effects large enough to be considered meaningful. There is most likely not a PhD student in the world who would be willing to examine these effects, given the size that remains possible after these collaborative author involved replication studies. We can never conclude an effect is exactly zero, but that hardly matters – the effects are clearly too small to study.
The Steel Man for Replication Crisis Deniers
Given the reward structures in science, it is extremely rewarding for individual researchers to speak out against the status quo. Currently, the status quo is that the scientific community has accepted there is a replication crisis. Some people attempt to criticize this belief. This is important. All established beliefs in science should be open to criticism.
Most papers that aim to challenge the fact that many scientific domains have a surprising difficulty successfully replicating findings once believed reliable focus on the 100 studies in the Replicability Project: Psychology that was started a decade ago, and published in 2015. This project was incredibly successful in creating awareness of concerns around replicability, but it was not incredibly informative about how big the problem was.
In the conclusion of the RP:P, the authors wrote: “After this intensive effort to reproduce a sample of published psychological findings, how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero. Is this a limitation of the project design? No. It is the reality of doing science, even if it is not appreciated in daily practice. Humans desire certainty, and science infrequently provides it. As much as we might wish it to be otherwise, a single study almost never provides definitive resolution for or against an effect and its explanation.” The RP:P was an important project, but it is no longer the project to criticize if you want to provide evidence against the presence of a replication crisis.
Since the start of the RP:P, other projects have aimed to complement our insights about replicability. Registered Replication Reports focused on single studies, replicated in much larger sample sizes, to reduce the probability of a Type 2 error. These studies often quite conclusively showed original studies did not replicate, and a surprisingly large number yielded findings not statistically different from 0, despite sample sizes much larger than psychologists would be able to collect in normal research lines. Many Labs studies focused on a smaller set of studies, replicated many times, sometimes with minor variations to examine the role of possible moderators proposed to explain failures to replicate, which were typically absent.
The collaborative author involved replications are the latest addition to this expanding literature that consistently shows great difficulties in replicating findings. I believe they currently make up the steel man for researchers motivated to cast doubt on the presence of a replication crisis. I believe the fact that these large projects with direct involvement of the original authors can not find support for predicted effects are the strongest evidence too date that we have a problem replicating findings. Of course, these studies are complemented by Registered Replication Reports and Many Labs studies, and together they make up the Steel Man to argue against if you are a Replication Crisis Denier.
Coles, N. A., March, D. S., Marmolejo-Ramos, F., Larsen, J., Arinze, N. C., Ndukaihe, I., Willis, M., Francesco, F., Reggev, N., Mokady, A., Forscher, P. S., Hunter, J., Gwenaël, K., Yuvruk, E., Kapucu, A., Nagy, T., Hajdu, N., Tejada, J., Freitag, R., … Marozzi, M. (2022). A Multi-Lab Test of the Facial Feedback Hypothesis by The Many Smiles Collaboration. PsyArXiv. https://doi.org/10.31234/osf.io/cvpuw
Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Hilgard, J., Ahn, P. H., Brady, A. J., Chartier, C. R., Christopherson, C. D., Clay, S., Collisson, B., Crawford, J. T., Cromar, R., Gardiner, G., Gosnell, C. L., Grahe, J., Hall, C., Howard, I., … Ratliff, K. A. (2022). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement. Collabra: Psychology, 8(1), 35271. https://doi.org/10.1525/collabra.35271
Morey, R. D., Kaschak, M. P., Díez-Álamo, A. M., Glenberg, A. M., Zwaan, R. A., Lakens, D., Ibáñez, A., García, A., Gianelli, C., Jones, J. L., Madden, J., Alifano, F., Bergen, B., Bloxsom, N. G., Bub, D. N., Cai, Z. G., Chartier, C. R., Chatterjee, A., Conwell, E., … Ziv-Crispel, N. (2021). A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE). Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-021-01927-8
Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J. L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi, J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay, S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733
In a recent paper Muff, Nilsen, O’Hara, and Nater (2021) propose to implement the recommendation “to regard P-values as what they are, namely, continuous measures of statistical evidence“. This is a surprising recommendation, given that p-values are not valid measures of evidence (Royall, 1997). The authors follow Bland (2015) who suggests that “Itis preferable to think of the significance test probability as an index of the strength of evidence against the null hypothesis” and proposed verbal labels for p-values in specific ranges (i.e., p-values above 0.1 are ‘little to no evidence’, p-values between 0.1 and 0.05 are ‘weak evidence’, etc.). P-values are continuous, but the idea that they are continuous measures of ‘evidence’ has been criticized (e.g., Goodman & Royall, 1988). If the null-hypothesis is true, p-values are uniformly distributed. This means it is just as likely to observe a p-value of 0.001 as it is to observe a p-value of 0.999. This indicates that the interpretation of p = 0.001 as ‘strong evidence’ cannot be defended just because the probability to observe this p-value is very small. After all, if the null hypothesis is true, the probability of observing p = 0.999 is exactly as small.
The reason that small p-values can be used to guide us in the direction of true effects is not because they are rarely observed when the null-hypothesis is true, but because they are relatively less likely to be observed when the null hypothesis is true, than when the alternative hypothesis is true. For this reason, statisticians have argued that the concept of evidence is necessarily ‘relative’. We can quantify evidence in favor of one hypothesis over another hypothesis, based on the likelihood of observing data when the null hypothesis is true, compared to this probability when an alternative hypothesis is true. As Royall (1997, p. 8) explains: “The law of likelihood applies to pairs of hypotheses, telling when a given set of observations is evidence for one versus the other: hypothesis A is better supported than B if A implies a greater probability for the observations than B does. This law represents a concept of evidence that is essentially relative, one that does not apply to a single hypothesis, taken alone.” As Goodman and Royall (1988, p. 1569) write, “The p-value is not adequate for inference because the measurement of evidence requires at least three components: the observations, and two competing explanations for how they were produced.”
In practice, the problem of interpreting p-values as evidence in absence of a clearly defined alternative hypothesis is that they at best serve as proxies for evidence, but not as a useful measure where a specific p-value can be related to a specific strength of evidence. In some situations, such as when the null hypothesis is true, p-values are unrelated to evidence. In practice, when researchers examine a mix of hypotheses where the alternative hypothesis is sometimes true, p-values will be correlated with measures of evidence. However, this correlation can be quite weak (Krueger, 2001), and in general this correlation is too weak for p-values to function as a valid measure of evidence, where p-values in a specific range can directly be associated with ‘strong’ or ‘weak’ evidence.
Why single p-values cannot be interpreted as the strength of evidence
The evidential value of a single p-value depends on the statistical power of the test (i.e., on the sample size in combination with the effect size of the alternative hypothesis). The statistical power expresses the probability of observing a p-value smaller than the alpha level if the alternative hypothesis is true. When the null hypothesis is true, statistical power is formally undefined, but in practice in a two-sided test ?% of the observed p-values will fall below the alpha level, as p-values are uniformly distributed under the null-hypothesis. The horizontal grey line in Figure 1 illustrates the expected p-value distribution for a two-sided independent t-test if the null-hypothesis is true (or when the observed effect size Cohen’s d is 0). As every p-value is equally likely, they can not quantify the strength of evidence against the null hypothesis.
Figure 1: P-value distributions for a statistical power of 0% (grey line), 50% (black curve) and 99% (dotted black curve).
If the alternative hypothesis is true the strength of evidence that corresponds to a p-value depends on the statistical power of the test. If power is 50%, we should expect that 50% of the observed p-values fall below the alpha level. The remaining p-values fall above the alpha level. The black curve in Figure 1 illustrates the p-value distribution for a test with a statistical power of 50% for an alpha level of 5%. A p-value of 0.168 is more likely when there is a true effect that is examined in a statistical test with 50% power than when the null hypothesis is true (as illustrated by the black curve being above the grey line at p = 0.168). In other words, a p-value of 0.168 is evidence foran alternative hypothesis examined with 50% power, compared to the null hypothesis.
If an effect is examined in a test with 99% power (the dotted line in Figure 1) we would draw a different conclusion. With such high power p-values larger than the alpha level of 5% are rare (they occur only 1% of the time) and a p-value of 0.168 is much more likely to be observed when the null-hypothesis is true than when a hypothesis is examined with 99% power. Thus, a p-value of 0.168 is evidence against an alternative hypothesis examined with 99% power, compared to the null hypothesis.
Figure 1 illustrates that with 99% power even a ‘statistically significant’ p-value of 0.04 is evidence for of the null-hypothesis. The reason for this is that the probability of observing a p-value of 0.04 is more likely when the null hypothesis is true than when a hypothesis is tested with 99% power (i.e., the grey horizontal line at p = 0.04 is above the dotted black curve). This fact, which is often counterintuitive when first encountered, is known as the Lindley paradox, or the Jeffreys-Lindley paradox (for a discussion, see Spanos, 2013).
Figure 1 illustrates that different p-values can correspond to the same relative evidence in favor of a specific alternative hypothesis, and that the same p-value can correspond to different levels of relative evidence. This is obviously undesirable if we want to use p-values as a measure of the strength of evidence. Therefore, it is incorrect to verbally label any p-value as providing ‘weak’, ‘moderate’, or ‘strong’ evidence against the null hypothesis, as depending on the alternative hypothesis a researcher is interested in, the level of evidence will differ (and the p-value could even correspond to evidence in favor of the null hypothesis).
All p-values smaller than 1 correspond to evidence for some non-zero effect
If the alternative hypothesis is not specified, any p-value smaller than 1 should be treated as at least some evidence (however small) for somealternative hypotheses. It is therefore not correct to follow the recommendations of the authors in their Table 2 to interpret p-values above 0.1 (e.g., a p-value of 0.168) as “no evidence” for a relationship. This also goes against the arguments by Muff and colleagues that ‘the notion of (accumulated) evidence is the main concept behind meta-analyses”. Combining three studies with a p-value of 0.168 in a meta-analysis is enough to reject the null hypothesis based on p < 0.05 (see the forest plot in Figure 2). It thus seems ill-advised to follow their recommendation to describe a single study with p = 0.168 as ‘no evidence’ for a relationship.
Figure 2: Forest plot for a meta-analysis of three identical studies yielding p = 0.168.
However, replacing the label of ‘no evidence’ with the label ‘at least some evidence for some hypotheses’ leads to practical problems when communicating the results of statistical tests. It seems generally undesirable to allow researchers to interpret any p-value smaller than 1 as ‘at least some evidence’ against the null hypothesis. This is the price one pays for not specifying an alternative hypothesis, and try to interpret p-values from a null hypothesis significance test in an evidential manner. If we do not specify the alternative hypothesis, it becomes impossible to conclude there is evidence for the null hypothesis, and we cannot statistically falsify any hypothesis (Lakens, Scheel, et al., 2018). Some would argue that if you can not falsify hypotheses, you have a bit of a problem (Popper, 1959).
Interpreting p-values as p-values
Instead of interpreting p-values as measures of the strength of evidence, we could consider a radical alternative: interpret p-values as p-values. This would, perhaps surprisingly, solve the main problems that Muff and colleagues aim to address, namely ‘black-or-white null-hypothesis significance testing with an arbitrary P-value cutoff’. The idea to interpret p-values as measures of evidence is most strongly tried to a Fisherian interpretation of p-values. An alternative statistical frequentist philosophy was developed by Neyman and Pearson (1933a) who propose to use p-values to guide decisions about the null and alternative hypothesis by, in the long run, controlling the Type I and Type II error rate. Researchers specify an alpha level and design a study with a sufficiently high statistical power, and reject (or fail to reject) the null hypothesis.
Neyman and Pearson never proposed to use hypothesis tests as binary yes/no test outcomes. First, Neyman and Pearson (1933b) leave open whether the states of the world are divided in two (‘accept’ and ‘reject’) or three regions, and write that a “region of doubt may be obtained by a further subdivision of the region of acceptance”. A useful way to move beyond a yes/no dichotomy in frequentist statistics is to test range predictions instead of limiting oneself to a null hypothesis significance test (Lakens, 2021). This implements the idea of Neyman and Pearson to introduce a region of doubt, and distinguishes inconclusive results (where neither the null hypothesis nor the alternative hypothesis can be rejected, and more data needs to be collected to draw a conclusion) from conclusive results (where either the null hypothesis or the alternative hypothesis can be rejected.
In a Neyman-Pearson approach to hypothesis testing the act of rejecting a hypothesis comes with a maximum long run probability of doing so in error. As Hacking (1965) writes: “Rejection is not refutation. Plenty of rejections must be only tentative.” So when we reject the null model, we do so tentatively, aware of the fact we might have done so in error, and without necessarily believing the null model is false. For Neyman (1957, p. 13) inferential behavior is an: “act of will to behave in the future (perhaps until new experiments are performed) in a particular manner, conforming with the outcome of the experiment”. All knowledge in science is provisional.
Furthermore, it is important to remember that hypothesis tests reject a statistical hypothesis, but not a theoretical hypothesis. As Neyman (1960, p. 290) writes: “the frequency of correct conclusions regarding the statistical hypothesis tested may be in perfect agreement with the predictions of the power function, but not the frequency of correct conclusions regarding the primary hypothesis”. In other words, whether or not we can reject a statistical hypothesis in a specific experiment does not necessarily inform us about the truth of the theory. Decisions about the truthfulness of a theory requires a careful evaluation of the auxiliary hypotheses upon which the experimental procedure is built (Uygun Tunç & Tunç, 2021).
Neyman (1976) provides some reporting examples that reflect his philosophy on statistical inferences: “after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership”. An example of a shorter statement that Neyman provides reads: “As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population.”
A complete verbal description of the result of a Neyman-Pearson hypothesis test acknowledges two sources of uncertainty. First, the assumptions of the statistical test must be met (i.e., data is normally distributed), or any deviations should be small enough to not have any substantial effect on the frequentist error rates. Second, conclusions are made “Without hoping to know. whether each separate hypothesis is true or false” (Neyman & Pearson, 1933a). Any single conclusion can be wrong, and assuming the test assumption are met, we make claims under a known maximum error rate (which is never zero). Future replication studies are needed to provide further insights about whether the current conclusion was erroneous or not.
After observing a p-value smaller than the alpha level, one can therefore conclude: “Until new data emerges that proves us wrong, we decide to act as if there is an effect, while acknowledging that the methodological procedure we base this decision on has, a maximum error rate of alpha% (assuming the statistical assumptions are met), which we find acceptably low.” One can follow such a statement about the observed data with a theoretical inference, such as “assuming our auxiliary hypotheses hold, the result of this statistical test corroborates our theoretical hypothesis”. If a conclusive test result in an equivalence test is observed that allows a researcher to reject the presence of any effect large enough to be meaningful, the conclusion would be that the test result does not corroborate the theoretical hypothesis.
The problem that the common application of null hypothesis significance testing in science is based on an arbitrary threshold of 0.05 is true (Lakens, Adolfi, et al., 2018). There are surprisingly few attempts to provide researchers with practical approaches to determine an alpha level on more substantive grounds (but see Field et al., 2004; Kim & Choi, 2021; Maier & Lakens, 2021; Miller & Ulrich, 2019; Mudge et al., 2012). It seems difficult to resolve in practice, both because at least some scientist adopt a philosophy of science where the goal of hypothesis tests is to establish a corpus of scientific claims (Frick, 1996), and any continuous measure will be broken up in a threshold below which a researcher are not expected to make a claim about a finding (e.g., a BF < 3, see Kass & Raftery, 1995, or a likelihood ratio lower than k = 8, see Royall, 2000). Although it is true that an alpha level of 0.05 is arbitrary, there are some pragmatic arguments in its favor (e.g., it is established, and it might be low enough to yield claims that are taken seriously, but not high enough to prevent other researchers from attempting to refute the claim, see Uygun Tunç et al., 2021).
If there really no agreement on best practices in sight?
One major impetus for the flawed proposal to interpret p-values as evidence by Muff and colleagues is that “no agreement on a way forward is in sight”. The statement that there is little agreement among statisticians is an oversimplification. I will go out on a limb and state some things I assume most statisticians agree on. First, there are multiple statistical tools one can use, and each tool has their own strengths and weaknesses. Second, there are different statistical philosophies, each with their own coherent logic, and researchers are free to analyze data from the perspective of one or multiple of these philosophies. Third, one should not misuse statistical tools, or apply them to attempt to answer questions the tool was not designed to answer.
It is true that there is variation in the preferences individuals have about which statistical tools should be used, and the philosophies of statistical researchers should adopt. This should not be surprising. Individual researchers differ in which research questions they find interesting within a specific content domain, and similarly, they differ in which statistical questions they find interesting when analyzing data. Individual researchers differ in which approaches to science they adopt (e.g., a qualitative or a quantitative approach), and similarly, they differ in which approach to statistical inferences they adopt (e.g., a frequentist or Bayesian approach). Luckily, there is no reason to limit oneself to a single tool or philosophy, and if anything, the recommendation is to use multiple approaches to statistical inferences. It is not always interesting to ask what the p-value is when analyzing data, and it is often interesting to ask what the effect size is. Researchers can believe it is important for reliable knowledge generation to control error rates when making scientific claims, while at the same time believing that it is important to quantify relative evidence using likelihoods or Bayes factors (for example by presented a Bayes factor alongside every p-value for a statistical test, Lakens et al., 2020).
Whatever approach to statistical inferences researchers choose to use, the approach should answer a meaningful statistical question (Hand, 1994), the approach to statistical inferences should be logically coherent, and the approach should be applied correctly. Despite the common statement in the literature that p-values can be interpreted as measures of evidence, the criticism against the coherence of this approach should make us pause. Given that coherent alternatives exist, such as likelihoods (Royall, 1997) or Bayes factors (Kass & Raftery, 1995), researchers should not follow the recommendation by Muff and colleagues to report p = 0.08 as ‘weak evidence’, p = 0.03 as ‘moderate evidence’, and p = 0.168 as ‘no evidence’.
Bland, M. (2015). An introduction to medical statistics (Fourth edition). Oxford University Press.
Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Possingham, H. P. (2004). Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecology Letters, 7(8), 669–675. https://doi.org/10.1111/j.1461-0248.2004.00625.x
Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379
Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568–1574.
Hand, D. J. (1994). Deconstructing Statistical Questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3), 317–356. https://doi.org/10.2307/2983526
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572
Kim, J. H., & Choi, I. (2021). Choosing the Level of Significance: A Decision-theoretic Approach. Abacus, 57(1), 27–71. https://doi.org/10.1111/abac.12172
Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. https://doi.org/10.1037//0003-066X.56.1.16
Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648. https://doi.org/10.1177/1745691620958012
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., Baguley, T., Becker, R. B., Benning, S. D., Bradford, D. E., Buchanan, E. M., Caldwell, A. R., Calster, B., Carlsson, R., Chen, S.-C., Chung, B., Colling, L. J., Collins, G. S., Crook, Z., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2020). Improving Inferences About Null Effects With Bayes Factors and Equivalence Tests. The Journals of Gerontology: Series B, 75(1), 45–57. https://doi.org/10.1093/geronb/gby065
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963
Maier, M., & Lakens, D. (2021). Justify Your Alpha: A Primer on Two Practical Approaches. PsyArXiv. https://doi.org/10.31234/osf.io/ts4r6
Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734
Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7–22. https://doi.org/10.2307/1401671
Neyman, J. (1960). First course in probability and statistics. Holt, Rinehart and Winston.
Neyman, J. (1976). Tests of statistical hypotheses and their use in studies of natural phenomena. Communications in Statistics – Theory and Methods, 5(8), 737–751. https://doi.org/10.1080/03610927608827392
Neyman, J., & Pearson, E. S. (1933a). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009
Neyman, J., & Pearson, E. S. (1933b). The testing of statistical hypotheses in relation to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical Society, 29(04), 492–510. https://doi.org/10.1017/S030500410001152X
Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall/CRC.
Royall, R. (2000). On the probability of observing misleading statistical evidence. Journal of the American Statistical Association, 95(451), 760–768.
Spanos, A. (2013). Who should be afraid of the Jeffreys-Lindley paradox? Philosophy of Science, 80(1), 73–93.
Uygun Tunç, D., & Tunç, M. N. (2021). A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework. In Meta-Psychology. https://doi.org/10.31234/osf.io/pdm7y
Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by
During a recent workshop on Sample Size Justification an early career researcher asked me: “You recommend sequential analysis in your paperfor when effect sizes are uncertain, where researchers collect data, analyze the data, stop when a test is significant, or continue data collection when a test is not significant, and, I don’t want to be rude, but isn’t this p-hacking?”
In linguistics there is a term for when children apply a rule they have learned to instances where it does not apply: Overregularization. They learn ‘one cow, two cows’, and use the +s rule for plural where it is not appropriate, such as ‘one mouse, two mouses’ (instead of ‘two mice’). The early career researcher who asked me if sequential analysis was a form of p-hacking was also overregularizing. We teach young researchers that flexibly analyzing data inflates error rates, is called p-hacking, and is a very bad thing that was one of the causes of the replication crisis. So, they apply the rule ‘flexibility in the data analysis is a bad thing’ to cases where it does not apply, such as in the case of sequential analyses. Yes, sequential analyses give a lot of flexibility to stop data collection, but it does so while carefully controlling error rates, with the added bonus that it can increase the efficiency of data collection. This makes it a good thing, not p-hacking.
Children increasingly use correct language the longer they are immersed in it. Many researchers are not yet immersed in an academic environment where they see flexibility in the data analysis applied correctly. Many are scared to do things wrong, which risks becoming overly conservative, as the pendulum from ‘we are all p-hacking without realizing the consequences’ swings back to far to ‘all flexibility is p-hacking’. Therefore, I patiently explain during workshops that flexibility is not bad per se, but that making claims without controlling your error rate is problematic.
In a recent podcast episode of ‘Quantitude’ one of the hosts shared a similar experience 5 minutes into the episode. A young student remarked that flexibility during the data analysis was ‘unethical’. The remainder of the podcast episode on ‘researcher degrees of freedom’ discussed how flexibility is part of data analysis. They clearly state that p-hacking is problematic, and opportunistic motivations to perform analyses that give you what you want to find should be constrained. But they then criticized preregistration in ways many people on Twitter disagreed with. They talk about ‘high priests’ who want to ‘stop bad people from doing bad things’ which they find uncomfortable, and say ‘you can not preregister every contingency’. They remark they would be surprised if data could be analyzed without requiring any on the fly judgment.
Although the examples they gave were not very good1 it is of course true that researchers sometimes need to deviate from an analysis plan. Deviating from an analysis plan is not p-hacking. But when people talk about preregistration, we often see overregularization: “Preregistration requires specifying your analysis plan to prevent inflation of the Type 1 error rate, so deviating from a preregistration is not allowed.” The whole point of preregistration is to transparently allow other researchers to evaluate the severity of a test, both when you stick to the preregistered statistical analysis plan, as when you deviate from it. Some researchers have sufficient experience with the research they do that they can preregister an analysis that does not require any deviations2, and then readers can see that the Type 1 error rate for the study is at the level specified before data collection. Other researchers will need to deviate from their analysis plan because they encounter unexpected data. Some deviations reduce the severity of the test by inflating the Type 1 error rate. But other deviations actually get you closer to the truth. We can not know which is which. A reader needs to form their own judgment about this.
A final example of overregularization comes from a person who discussed a new study that they were preregistering with a junior colleague. They mentioned the possibility of including a covariate in an analysis but thought that was too exploratory to be included in the preregistration. The junior colleague remarked: “But now that we have thought about the analysis, we need to preregister it”. Again, we see an example of overregularization. If you want to control the Type 1 error rate in a test, preregister it, and follow the preregistered statistical analysis plan. But researchers can, and should, explore data to generate hypotheses about things that are going on in their data. You can preregister these, but you do not have to. Not exploring data could even be seen as research waste, as you are missing out on the opportunity to generate hypotheses that are informed by data. A case can be made that researchers should regularly include variables to explore (e.g., measures that are of general interest to peers in their field), as long as these do not interfere with the primary hypothesis test (and as long as these explorations are presented as such).
In the book “Reporting quantitative research in psychology: How to meet APA Style Journal Article Reporting Standards” by Cooper and colleagues from 2020 a very useful distinction is made between primary hypotheses, secondary hypotheses, and exploratory hypotheses. The first consist of the main tests you are designing the study for. The secondary hypotheses are also of interest when you design the study – but you might not have sufficient power to detect them. You did not design the study to test these hypotheses, and because the power for these tests might be low, you did not control the Type 2 error rate for secondary hypotheses. You canpreregister secondary hypotheses to control the Type 1 error rate, as you know you will perform them, and if there are multiple secondary hypotheses, as Cooper et al (2020) remark, readers will expect “adjusted levels of statistical significance, or conservative post hoc means tests, when you conducted your secondary analysis”.
If you think of the possibility to analyze a covariate, but decide this is an exploratory analysis, you can decide to neither control the Type 1 error rate nor the Type 2 error rate. These are analyses, but not tests of a hypothesis, as any findings from these analyses have an unknown Type 1 error rate. Of course, that does not mean these analyses can not be correct in what they reveal – we just have no way to know the long run probability that exploratory conclusions are wrong. Future tests of the hypotheses generated in exploratory analyses are needed. But as long as you follow Journal Article Reporting Standards and distinguish exploratory analyses, readers know what the are getting. Exploring is not p-hacking.
People in psychology are re-learning the basic rules of hypothesis testing in the wake of the replication crisis. But because they are not yet immersed in good research practices, the lack of experience means they are overregularizing simplistic rules to situations where they do not apply. Not all flexibility is p-hacking, preregistered studies do not prevent you from deviating from your analysis plan, and you do not need to preregister every possible test that you think of. A good cure for overregularization is reasoning from basic principles. Do not follow simple rules (or what you see in published articles) but make decisions based on an understanding of how to achieve your inferential goal. If the goal is to make claims with controlled error rates, prevent Type 1 error inflation, for example by correcting the alpha level where needed. If your goal is to explore data, feel free to do so, but know these explorations should be reported as such. When you design a study, follow the Journal Article Reporting Standards and distinguish tests with different inferential goals.
1 E.g., they discuss having to choose between Student’s t-test and Welch’s t-test, depending on wheter Levene’s test indicates the assumption of homogeneity is violated, which is not best practice – just follow R, and use Welch’s t-test by default.
2 But this is rare – only 2 out of 27 preregistered studies in Psychological Science made no deviations. https://royalsocietypublishing.org/doi/full/10.1098/rsos.211037We can probably do a bit better if we only preregistered predictions at a time where we really understand our manipulations and measures.
Many of the facts in this blog post come from the biography ‘Neyman’ by Constance Reid. I highly recommend reading this book if you find this blog interesting.
In recent years researchers have become increasingly interested in the relationship between eugenics and statistics, especially focusing on the lives of Francis Galton, Karl Pearson, and Ronald Fisher. Some have gone as far as to argue for a causal relationship between eugenics and frequentist statistics. For example, in a recent book ‘Bernouilli’s Fallacy’, Aubrey Clayton speculates that Fisher’s decision to reject prior probabilities and embrace a frequentist approach was “also at least partly political”. Rejecting prior probabilities, Clayton argues, makes science seem more ‘objective’, which would have helped Ronald Fisher and his predecessors to establish eugenics as a scientific discipline, despite the often-racist conclusions eugenicists reached in their work.
When I was asked to review an early version of Clayton’s book for Columbia University Press, I thought that the main narrative was rather unconvincing, and thought the presented history of frequentist statistics was too one-sided and biased. Authors who link statistics to problematic political views often do not mention equally important figures in the history of frequentist statistics who were in all ways the opposite of Ronald Fisher. In this blog post, I want to briefly discuss the work and life of Jerzy Neyman, for two reasons.
Jerzy Neyman (image from https://statistics.berkeley.edu/people/jerzy-neyman)
First, the focus on Fisher’s role in the history of frequentist statistics is surprising, given that the dominant approach to frequentist statistics used in many scientific disciplines is the Neyman-Pearson approach. If you have ever rejected a null hypothesis because a p-value was smaller than an alpha level, or if you have performed a power analysis, you have used the Neyman-Pearson approach to frequentist statistics, and not the Fisherian approach. Neyman and Fisher disagreed vehemently about their statistical philosophies (in 1961 Neyman published an article titled ‘Silver Jubilee of My Dispute with Fisher’), but it was Neyman’s philosophy that won out and became the default approach to hypothesis testing in most fields[i]. Anyone discussing the history of frequentist hypothesis testing should therefore seriously engage with the work of Jerzy Neyman and Egon Pearson. Their work was not in line with the views of Karl Pearson, Egon’s father, nor the views of Fisher. Indeed, it was a great source of satisfaction to Neyman that their seminal 1933 paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of the work, and (as Neyman thought) reviewed by Fisher[ii], who strongly disagreed with their philosophy of statistics.
Second, Jerzy Neyman was also the opposite to Fisher in his political viewpoints. Instead of promoting eugenics, Neyman worked to improve the position of those less privileged throughout his life, teaching disadvantaged people in Poland, and creating educational opportunities for Americans at UC Berkeley. He hired David Blackwell, who was the first Black tenured faculty member at UC Berkeley. This is important, because it falsifies the idea put forward by Clayton[iii]that frequentist statistics became the dominant approach in science because the most important scientists who worked on it wanted to pretend their dubious viewpoints were based on ‘objective’ scientific methods.
I think it is useful to broaden the discussion of the history of statistics, beyond the work by Fisher and Karl Pearson, and credit the work of others[iv]who contributed in at least as important ways to the statistics we use today. I am continually surprised about how few people working outside of statistics even know the name of Jerzy Neyman, even though they regularly use his insights when testing hypotheses. In this blog, I will try to describe his work and life to add some balance to the history of statistics that most people seem to learn about. And more importantly, I hope Jerzy Neyman can be a positive role-model for young frequentist statisticians, who might so far have only been educated about the life of Ronald Fisher.
Neyman’s personal life
Neyman was born in 1984 in Russia, but raised in Poland. After attending the gymnasium, he studied at the University of Kharkov. Initially trying to become an experimental physicist, he was too clumsy with his hands, and switched to conceptual mathematics, in which he concluded his undergraduate in 1917 in politically tumultuous times. In 1919 he met his wife, and they marry in 1920. Ten days later, because of the war between Russia and Poland, Neyman is imprisoned for a short time, and in 1921 flees to a small village to avoid being arrested again, where he obtains food by teaching the children of farmers. He worked for the Agricultural Institute, and then worked at the University in Warsaw. He obtained his doctor’s degree in 1924 at age 30. In September 1925 he was sent to London for a year to learn about the latest developments in statistics from Karl Pearson himself. It is here that he met Egon Pearson, Karl’s son, and a friendship and scientific collaboration starts.
Neyman always spends a lot of time teaching, often at the expense of doing scientific work. He was involved in equal opportunity education in 1918 in Poland, teaching in dimly lit classrooms where the rag he used to wipe the blackboard would sometimes freeze. He always had a weak spot for intellectuals from ‘disadvantaged’ backgrounds. He and his wife were themselves very poor until he moved to UC Berkeley in 1938. In 1929, back in Poland, his wife becomes ill due to their bad living conditions, and the doctor who comes to examine her is so struck by their miserable living conditions he offers the couple stay in his house for the same rent they were paying while he visits France for 6 months. In his letters to Egon Pearson from this time, he often complained that the struggle for existence takes all his time and energy, and that he can not do any scientific work.
Even much later in his life, in 1978, he kept in mind that many people have very little money, and he calls ahead to restaurants to make sure a dinner before a seminar would not cost too much for the students. It is perhaps no surprise that most of his students (and he had many) talk about Neyman with a lot of appreciation. He wasn’t perfect (for example, Erich Lehmann – one of Neyman’s students – remarks how he was no longer allowed to teach a class after Lehmann’s notes, building on but extending the work by Neyman, became extremely popular – suggesting Neyman was no stranger to envy). But his students were extremely positive about the atmosphere he created in his lab. For example, job applicants were told around 1947 that “there is no discrimination on the basis of age, sex, or race … authors of joint papers are always listed alphabetically.”
Neyman himself often suffered discrimination, sometimes because of his difficulty mastering the English language, sometimes for being Polish (when in Paris a piece of clothing, and ermine wrap, is stolen from their room, the police responds “What can you expect – only Poles live there!”), sometimes because he did not believe in God, and sometimes because his wife was Russian and very emancipated (living independently in Paris as an artist). He was fiercely against discrimination. In 1933, as anti-Semitism is on the rise among students at the university where he works in Poland, he complains to Egon Pearson in a letter that the students are behaving with Jews as Americans do with people of color. In 1941 at UC Berkeley he hired women at a time it was not easy for a woman to get a job in mathematics.
In 1942, Neyman examined the possibility of hiring David Blackwell, a Black statistician, then still a student. Neyman met him in New York (so that Blackwell does not need to travel to Berkeley at his own expense) and considered Blackwell the best candidate for the job. The wife of a mathematics professor (who was born in the south of the US) learned about the possibility that a Black statistician might be hired, warns she will not invite a Black man to her house, and there was enough concern for the effect the hire would have on the department that Neyman can not make an offer to Blackwell. He is able to get Blackwell to Berkeley in 1953 as a visiting professor, and offers him a tenured job in 1954, making David Blackwell the first tenured faculty member at the University of Berkeley, California. And Neyman did this, even though Blackwell was a Bayesian[v];).
In 1963, Neyman travelled to the south of the US and for the first time directly experienced the segregation. Back in Berkeley, a letter is written with a request for contributions for the Southern Christian Leadership Conference (founded by Martin Luther King, Jr. and others), and 4000 copies are printed and shared with colleagues at the university and friends around the country, which brought in more than $3000. He wrote a letter to his friend Harald Cramér that he believed Martin Luther King, Jr. deserved a Nobel Peace Prize (which Cramér forwarded to the chairman of the Nobel Committee, and which he believed might have contributed at least a tiny bit to fact that Martin Luther King, Jr. was awarded the Nobel Prize a year later). Neyman also worked towards the establishment of a Special Scholarships Committee at UC Berkeley with the goal of providing education opportunities to disadvantaged Americans
Neyman was not a pacifist. In the second world war he actively looked for ways he could contribute to the war effort. He is involved in statistical models that compute the optimal spacing of bombs by planes to clear a path across a beach of land mines. (When at a certain moment he needs specifics about the beach, a representative from the military who is not allowed to directly provide this information asks if Neyman has ever been to the seashore in France, to which Neyman replies he has been to Normandy, and the representative answers “Then use that beach!”). But Neyman early and actively opposed the Vietnam war, despite the risk of losing lucrative contracts the Statistical Laboratory had with the Department of Defense. In 1964 he joined a group of people who bought advertisements in local newspapers with a picture of a napalmed Vietnamese child with the quote “The American people will bluntly and plainly call it murder”.
A positive role model
It is important to know the history of a scientific discipline. Histories are complex, and we should resist overly simplistic narratives. If your teacher explains frequentist statistics to you, it is good if they highlight that someone like Fisher had questionable ideas about eugenics. But the early developments in frequentist statistics involved many researchers beyond Fisher, and, luckily, there are many more positive role-models that also deserve to be mentioned – such as Jerzy Neyman. Even though Neyman’s philosophy on statistical inferences forms the basis of how many scientists nowadays test hypotheses, his contributions and personal life are still often not discussed in histories of statistics – an oversight I hope the current blog post can somewhat mitigate. If you want to learn more about the history of statistics through Neyman’s personal life, I highly recommend the biography of Neyman by Constance Reid, which was the source for most of the content of this blog post.
[i] See Hacking, 1965: “The mature theory of Neyman and Pearson is very nearly the received theory on testing statistical hypotheses.”
[ii] It turns out, in the biography, that it was not Fisher, but A. C. Aitken, who reviewed the paper positively.
[iii] Clayton’s book seems to be mainly intended as an attempt to persuade readers to become a Bayesian, and not as an accurate analysis of the development of frequentist statistics.
[iv] William Gosset (or ‘Student’, from ‘Student’s t-test’), who was the main inspiration for the work by Neyman and Pearson, is another giant in frequentist statistics who does not in any way fit into the narrative that frequentist statistics is tied to eugenics, as his statistical work was motivated by applied research questions in the Guinness brewery. Gosset was a modest man – which is probably why he rarely receives the credit he is due.
[v] When asked about his attitude towards Bayesian statistics in 1979, he answered: “It does not interest me. I am interested in frequencies.” He did note multiple legitimate approaches to statistics exist, and the choice one makes is largely a matter of personal taste. Neyman opposed subjective Bayesian statistics because their use could lead to bad decision procedures, but was very positive about later work by Wald, which inspired Bayesian statistical decision theory.
In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?
What is the goal of statistical inference?
When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.
Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.
The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.
What is the advantage of your approach in a practical/applied context?
A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130 are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).
Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.
In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).
When do you think the other approach may be applicable?
Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.
In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.
In 2009 I attended a European Social Cognition Network meeting in Poland. I only remember one talk from that meeting: A short presentation in a nearly empty room. The presenter was a young PhD student – Stephane Doyen. He discussed two studies where he tried to replicate a well-known finding in social cognition research related to elderly priming, which had shown that people walked more slowly after being subliminally primed with elderly related words, compared to a control condition.
His presentation blew my mind. But it wasn’t because the studies failed to replicate – it was widely known in 2009 that these studies couldn’t be replicated. Indeed, around 2007, I had overheard two professors in a corridor discussing the problem that there were studies in the literature everyone knew would not replicate. And they used this exact study on elderly priming as one example. The best solution the two professors came up with to correct the scientific record was to establish an independent committee of experts that would have the explicit task of replicating studies and sharing their conclusions with the rest of the world. To me, this sounded like a great idea.
And yet, in this small conference room in Poland, there was this young PhD student, acting as if we didn’t need specially convened institutions of experts to inform the scientific community that a study could not be replicated. He just got up, told us about how he wasn’t able to replicate this study, and sat down.
It was heroic.
If you’re struggling to understand why on earth I thought this was heroic, then this post is for you. You might have entered science in a different time. The results of replication studies are no longer communicated only face to face when running into a colleague in the corridor, or at a conference. But I was impressed in 2009. I had never seen anyone give a talk in which the only message was that an original effect didn’t stand up to scrutiny. People sometimes presented successful replications. They presented null effects in lines of research where the absence of an effect was predicted in some (but not all) tests. But I’d never seen a talk where the main conclusion was just: “This doesn’t seem to be a thing”.
On 12 September 2011 I sent Stephane Doyen an email. “Did you ever manage to publish some of that work? I wondered what has happened to it.” Honestly, I didn’t really expect that he would manage to publish these studies. After all, I couldn’t remember ever having seen a paper in the literature that was just a replication. So I asked, even though I did not expect he would have been able to publish his findings.
Surprisingly enough, he responded that the study would soon appear in press. I wasn’t fully aware of new developments in the publication landscape, where Open Access journals such as PlosOne published articles as long as the work was methodologically solid, and the conclusions followed from the data. I shared this news with colleagues, and many people couldn’t wait to read the paper: An article, in print, reporting the failed replication of a study many people knew to be not replicable. The excitement was not about learning something new. The excitement was about seeing replication studies with a null effect appear in print.
Regrettably, not everyone was equally excited. The publication also led to extremely harsh online comments from the original researcher about the expertise of the authors (e.g., suggesting that findings can fail to replicate due to “Incompetent or ill-informed researchers”), and the quality of PlosOne (“which quite obviously does not receive the usual high scientific journal standards of peer-review scrutiny”). This type of response happened again, and again, and again. Another failed replication led to a letter by the original authors that circulated over email among eminent researchers in the area, was addressed to the original authors, and ended with “do yourself, your junior co-authors, and the rest of the scientific community a favor. Retract your paper.”
Some of the historical record on discussions between researchers around between 2012-2015 survives online, in Twitter and Facebook discussions, and blogs. But recently, I started to realize that most early career researchers don’t read about the replication crisis through these original materials, but through summaries, which don’t give the same impression as having lived through these times. It was weird to see established researchers argue that people performing replications lacked expertise. That null results were never informative. That thanks to dozens of conceptual replications, the original theoretical point would still hold up even if direct replications failed. As time went by, it became even weirder to see that none of the researchers whose work was not corroborated in replication studies ever published a preregistered replication study to silence the critics. And why were there even two sides to this debate? Although most people agreed there was room for improvement and that replications should play some role in improving psychological science, there was no agreement on how this should work. I remember being surprised that a field was only now thinking about how to perform and interpret replication studies if we had been doing psychological research for more than a century.
I wanted to share this autobiographical memory, not just because I am getting old and nostalgic, but also because young researchers are most likely to learn about the replication crisis through summaries and high-level overviews. Summaries of history aren’t very good at communicating how confusing this time was when we lived through it. There was a lot of uncertainty, diversity in opinions, and lack of knowledge. And there were a lot of feelings involved. Most of those things don’t make it into written histories. This can make historical developments look cleaner and simpler than they actually were.
It might be difficult to understand why people got so upset about replication studies. After all, we live in a time where it is possible to publish a null result (e.g., in journals that only evaluate methodological rigor, but not novelty, journals that explicitly invite replication studies, and in Registered Reports). Don’t get me wrong: We still have a long way to go when it comes to funding, performing, and publishing replication studies, given their important role in establishing regularities, especially in fields that desire a reliable knowledge base. But perceptions about replication studies have changed in the last decade. Today, it is difficult to feel how unimaginable it used to be that researchers in psychology would share their results at a conference or in a scientific journal when they were not able to replicate the work by another researcher. I am sure it sometimes happened. But there was clearly a reason those professors I overheard in 2007 were suggesting to establish an independent committee to perform and publish studies of effects that were widely known to be not replicable.
As people started to talk about their experiences trying to replicate the work of others, the floodgates opened, and the shells fell off peoples’ eyes. Let me tell you that, from my personal experience, we didn’t call it a replication crisis for nothing. All of a sudden, many researchers who thought it was their own fault when they couldn’t replicate a finding started to realize this problem was systemic. It didn’t help that in those days it was difficult to communicate with people you didn’t already know. Twitter (which is most likely the medium through which you learned about this blog post) launched in 2006, but up to 2010 hardly any academics used this platform. Back then, it wasn’t easy to get information outside of the published literature. It’s difficult to express how it feels when you realize ‘it’s not me – it’s all of us’. Our environment influences which phenotypic traits express themselves. These experiences made me care about replication studies.
If you started in science when replications were at least somewhat more rewarded, it might be difficult to understand what people were making a fuss about in the past. It’s difficult to go back in time, but you can listen to the stories by people who lived through those times. Some highly relevant stories were shared after the recent multi-lab failed replication of ego-depletion (see tweets by Tom Carpenter and Dan Quintana). You can ask any older researcher at your department for similar stories, but do remember that it will be a lot more difficult to hear the stories of the people who left academia because most of their PhD consisted of failures to build on existing work.
If you want to try to feel what living through those times must have been like, consider this thought experiment. You attend a conference organized by a scientific society where all society members get to vote on who will be a board member next year. Before the votes are cast, the president of the society informs you that one of the candidates has been disqualified. The reason is that it has come to the society’s attention that this candidate selectively reported results from their research lines: The candidate submitted only those studies for publication that confirmed their predictions, and did not share studies with null results, even though these null results were well designed studies that tested sensible predictions. Most people in the audience, including yourself, were already aware of the fact that this person selectively reported their results. You knew publication bias was problematic from the moment you started to work in science, and the field knew it was problematic for centuries. Yet here you are, in a room at a conference, where this status quo is not accepted. All of a sudden, it feels like it is possible to actually do something about a problem that has made you feel uneasy ever since you started to work in academia.
You might live through a time where publication bias is no longer silently accepted as an unavoidable aspect of how scientists work, and if this happens, the field will likely have a very similar discussion as it did when it started to publish failed replication studies. And ten years later, a new generation will have been raised under different scientific norms and practices, where extreme publication bias is a thing of the past. It will be difficult to explain to them why this topic was a big deal a decade ago. But since you’re getting old and nostalgic yourself, you think that it’s useful to remind them, and you just might try to explain it to them in a 2 minute TikTok video.
History merely repeats itself. It has all been done before. Nothing under the sun is truly new.
Thanks to Farid Anvari, Ruben Arslan, Noah van Dongen, Patrick Forscher, Peder Isager, Andrea Kis, Max Maier, Anne Scheel, Leonid Tiokhin, and Duygu Uygun for discussing this blog post with me (and in general for providing such a stimulating social and academic environment in times of a pandemic).
If you have educational material that you think will do a better job at preventing p-value misconceptions than the material in my MOOC, join the p-value misconception eradication challenge by proposing an improvement to my current material in a new A/B test in my MOOC.
I launched a massive open online course “Improving your statistical inferences” in October 2016. So far around 47k students have enrolled, and the evaluations suggest it has been a useful resource for many researchers. The first week focusses on p-values, what they are, what they aren’t, and how to interpret them.
Arianne Herrera-Bennet was interested in whether an understanding of p-values was indeed “impervious to correction” as some statisticians believe (Haller & Krauss, 2002, p. 1) and collected data on accuracy rates on ‘pop quizzes’ between August 2017 and 2018 to examine if there was any improvement in p-value misconceptions that are commonly examined in the literature. The questions were asked at the beginning of the course, after relevant content was taught, and at the end of the course. As the figure below from the preprint shows, there was clear improvement, and accuracy rates were quite high for 5 items, and reasonable for 3 items.
We decided to perform a follow-up from September 2018 where we added an assignment to week one for half the students in an ongoing A/B test in the MOOC. In this new assignment, we didn’t just explain what p-values are (as in the first assignment in the module all students do) but we also tried to specifically explain common misconceptions, to explain what p-values are not. The manuscript is still in preparation, but there was additional improvement for at least some misconceptions. It seems we can develop educational material that prevents p-value misconceptions – but I am sure more can be done.
In my paper to appear in Perspectives on Psychological Science on “The practical alternative to the p-value is the correctly used p-value” I write:
“Looking at the deluge of papers published in the last half century that point out how researchers have consistently misunderstood p-values, I am left to wonder: Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? It is nowadays relatively straightforward to create online apps where people can simulate studies and see the behavior of p values across studies, which can easily be combined with exercises that fit the knowledge level of bachelor and master students. The second point I want to make in this article is that a dedicated attempt to develop evidence based educational material in a cross-disciplinary team of statisticians, educational scientists, cognitive psychologists, and designers seems worth the effort if we really believe young scholars should understand p values. I do not think that the effort statisticians have made to complain about p-values is matched with a similar effort to improve the way researchers use p values and hypothesis tests. We really have not tried hard enough.”
If we honestly feel that misconceptions of p-values are a problem, and there are early indications that good education material can help, let’s try to do all we can to eradicate p-value misconceptions from this world.
We have collected enough data in the current A/B test. I am convinced the experimental condition adds some value to people’s understanding of p-values, so I think it would be best educational practice to stop presenting students with the control condition.
However, I there might be educational material out there that does a much better job than the educational material I made, to train away misconceptions. So instead of giving all students my own new assignment, I want to give anyone who thinks they can do an even better job the opportunity to demonstrate this. If you have educational material that you think will work even better than my current material, I will create a new experimental condition that contains your teaching material. Over time, we can see which materials performs better, and work towards creating the best educational material to prevent misunderstandings of p-values we can.
If you are interested in working on improving p-value education material, take a look at the first assignment in the module that all students do, and look at the new second assignment I have created to train away misconception (and the answers). Then, create (or adapt) educational material such that the assignment is similar in length and content. The learning goal should be to train away common p-value misconceptions – you can focus on any and all you want. If there are multiple people who are interested, we collectively vote on which material we should test first (but people are free to combine their efforts, and work together on one assignment). What I can offer is getting your material in front of between 300 and 900 students who enroll each week. Not all of them will start, not all of them will do the assignments, but your material should reach at least several hundreds of learners a year, of which around 40% has a masters degree, and 20% has a PhD – so you will be teaching fellow scientists (and beyond) to improve how they work.
I will incorporate this new assignment, and make it publicly available on my blog, as soon as it is done and decided on by all people who expressed interest in creating high quality teaching material. We can evaluate the performance by looking at the accuracy rates on test items. I look forward to seeing your material, and hope this can be a small step towards an increased effort in improving statistics education. We might have a long way to go to completely eradicate p-value misconceptions, but we can start.
On July 28, 2020, the first Dutch academic has been judged to have violated the code of conduct for research integrity for p-hacking and optional stopping with the aim of improving the chances of obtaining a statistically significant result. I think this is a noteworthy event that marks a turning point in the way the scientific research community interprets research practices that up to a decade ago were widely practiced. The researcher in question violated scientific integrity in several other important ways, including withdrawing blood without ethical consent, and writing grant proposals in which studies and data were presented that had not been performed and collected. But here, I want to focus on the judgment about p-hacking and optional stopping.
When I studied at Leiden University from 1998 to 2002 and commuted by train from my hometown of Rotterdam I would regularly smoke a cigar in the smoking compartment of the train during my commute. If I would enter a train today and light a cigar, the responses I would get from my fellow commuters would be markedly different than 20 years ago. They would probably display moral indignation or call the train conductor who would give me a fine. Times change.
When the reporton the fraud case of Diederik Stapel came out, the three committees were surprised by a research culture that accepted “sloppy science”. But it did not directly refer to these practices as violations of the code of conduct for research integrity. For example, on page 57 they wrote:
“In the recommendations, the Committees not only wish to focus on preventing or reducing fraud, but also on improving the research culture. The European Code refers to ‘minor misdemeanours’: some data massage, the omission of some unwelcome observations, ‘favourable’ rounding off or summarizing, etc. This kind of misconduct is not categorized under the ‘big three’ (fabrication, falsification, plagiarism) but it is, in principle, equally unacceptable and may, if not identified or corrected, easily lead to more serious breaches of standards of integrity.”
Compare this to the reportby LOWI, the Dutch National Body for Scientific Integrity, for a researcher at Leiden University who was judged to violate the code of conduct for research integrity for p-hacking and optional stopping (note this is my translation from Dutch of the advice on page 17 point IV, and point V on page 4):
“The Board has rightly ruled that Petitioner has violated standards of academic integrity with regard to points 2 to 5 of the complaint.”
With this, LOWI has judged that the Scientific Integrity Committee of Leiden University (abbreviated as CWI in Dutch) ruled correctly with respect to the following:
“According to the CWI, the applicant also acted in violation of scientific integrity by incorrectly using statistical methods (p-hacking) by continuously conducting statistical tests during the course of an experiment and by supplementing the research population with the aim of improving the chances of obtaining a statistically significant result.”
As norms change, what we deemed a misdemeanor before, is now simply classified as a violation of academic integrity. I am sure this is very upsetting for this researcher. We’ve seen similar responses in the past years, where single individuals suffered more than average researcher for behaviors that many others performed as well. They might feel unfairly singled out. The only difference between this researcher at Leiden University, and several others who performed identical behaviors, was that someone in their environment took the 2018 Netherlands Code of Conduct for Research Integrity seriously when they read section 3.7, point 56:
Call attention to other researchers’ non-compliance with the standards as well as inadequate institutional responses to non-compliance, if there is sufficient reason for doing so.
When it comes to smoking, rules in The Netherlands are regulated through laws. You’d think this would mitigate confusion, insecurity, and negative emotions during a transition – but that would be wishful thinking. In The Netherlands the whole transition has been ongoing for close to two decades, from an initial law allowing a smoke-free working environment in 2004, to a completely smoke-free university campus in August 2020.
The code of conduct for research integrity is not governed by laws, and enforcement of the code of conduct for research integrity is typically not anyone’s full time job. We can therefore expect the change to be even more slow than the changes in what we feel is acceptable behavior when it comes to smoking. But there is notable change over time.
We see a shift from the “big three” types of misconduct (fabrication, falsification, plagiarism), and somewhat vague language of misdemeanors, that is “in principle” equally unacceptable, and might lead to more serious breaches of integrity, to a clear classification of p-hacking and optional stopping as violations of scientific integrity. Indeed, if you ask me, the ‘bigness’ of plagiarism pales compared to how publication bias and selective reporting distort scientific evidence.
Compare this to smoking laws in The Netherlands, where early on it was still allowed to create separate smoking rooms in buildings, while from August 2020 onwards all school and university terrain (i.e., the entire campus, inside and outside of the buildings) needs to be a smoke-free environment. Slowly but sure, what is seen as acceptable changes.
I do not consider myself to be an exceptionally big idiot – I would say I am pretty average on that dimension – but it did not occur to me how stupid it was to enter a smoke-filled train compartment and light up a cigar during my 30 minute commute around the year 2000. At home, I regularly smoked a pipe (a gift from my father). I still have it. Just looking at the tar stains now makes me doubt my own sanity.
This is despite that fact that the relation between smoking and cancer was pretty well established since the 1960’s. Similarly, when I did my PhD between 2005 and 2009 I was pretty oblivious to the error rate inflation due to optional stopping, despite that fact that one of the more important papers on this topic was published by Armitage, McPherson, and Rowe in 1969. I did realize that flexibility in analyzing data could not be good for the reliability of the findings we reported, but just like when I lit a cigar in the smoking compartment in the train, I failed to adequately understand how bad it was.
When smoking laws became stricter, there was a lot of discussion in society. One might even say there was a strong polarization, where on the one hand newspaper articles appeared that claimed how outlawing smoking in the train was ‘totalitarian’, while we also had family members who would no longer allow people to smoke inside their house, which led my parents (both smokers) to stop visiting these family members. Changing norms leads to conflict. People feel personally attacked, they become uncertain, and in the discussions that follow we will see all opinions ranging from how people should be free to do what they want, to how people who smoke should pay more for healthcare.
We’ve seen the same in scientific reform, although the discussion is more often along the lines of how smoking can’t be that bad if my 95 year old grandmother has been smoking a packet a day for 70 years and feels perfectly fine, to how alcohol use or lack of exercise are much bigger problems and why isn’t anyone talking about those.
But throughout all this discussion, norms just change. Even my parents stopped smoking inside their own home around a decade ago. The Dutch National Body for Scientific Integrity has classified p-hacking and optional stopping as violations of research integrity. Science is continuously improving, but change is slow. Someone once explained to me that correcting the course of science is like steering an oil tanker – any change in direction takes a while to be noticed. But when change happens, it’s worth standing still to reflect on it, and look at how far we’ve come.