As more people have
started to use Bayes Factors, we should not be surprised that misconceptions
about Bayes Factors have become common. A recent study shows that the
percentage of scientific articles that draw incorrect inferences based on
observed Bayes Factors is distressingly high (Wong et al., 2022), with 92% of
articles demonstrating at least one misconception of Bayes Factors. Here I will
review some of the most common misconceptions, and how to prevent them.
Misunderstanding 1:
Confusing Bayes Factors with Posterior Odds.
One common criticism
by Bayesians of null hypothesis significance testing (NHST) is that NHST
quantifies the probability of the data (or more extreme data), given that the
null hypothesis is true, but that scientists should be interested in the
probability that the hypothesis is true, given the data. Cohen (1994) wrote:
What’s wrong with
NHST? Well, among many other things, it does not tell us what we want to know,
and we so much want to know what we want to know that, out of desperation, we
nevertheless believe that it does! What we want to know is “Given these data,
what is the probability that Ho is true?”
One might therefore
believe that Bayes factors tell us something about the probability that a
hypothesis true, but this is incorrect. A Bayes factor quantifies how much we
should update our belief in one hypothesis. If this hypothesis was extremely
unlikely (e.g., the probability that people have telepathy) this hypothesis
might still be very unlikely, even after computing a large Bayes factor in a
single study demonstrating telepathy. If we believed the hypothesis that people
have telepathy was unlikely to be true (e.g., we thought it was 99.9% certain
telepathy was not true) evidence for telepathy might only increase our belief
in telepathy to the extent that we now believe it is 98% unlikely. The Bayes
factor only corresponds to our posterior belief if we were perfectly uncertain
about the hypothesis being true or not. If both hypotheses were equally likely,
and a Bayes factor indicates we should update our belief in such a way that the
alternative hypothesis is three times more likely than the null hypothesis,
only then would we end up believing the alternative hypothesis is exactly three
times more likely than the null hypothesis. One should therefore not conclude
that, for example, given a BF of 10, the alternative hypothesis is more likely
to be true than the null hypothesis. The correct claim is that people should
update their belief in the alternative hypothesis by a factor of 10.
Misunderstanding 2:
Failing to interpret Bayes Factors as relative evidence.
One benefit of Bayes
factors that is often mentioned by Bayesians is that, unlike NHST, Bayes
factors can provide support for the null hypothesis, and thereby falsify
predictions. It is true that NHST can only reject the null hypothesis, although
it is important to add that in frequentist statistics equivalence tests can be
used to reject the alternative hypothesis, and therefore there is no need to
switch to Bayes factors to meaningfully interpret the results of
non-significant null hypothesis tests.
Bayes factors quantify
support for one hypothesis relative to another hypothesis. As with likelihood
ratios, it is possible that one hypothesis is supported more than another
hypothesis, while both hypotheses are actually false. It is incorrect to
interpret Bayes factors in an absolute manner, for example by stating that a
Bayes factor of 0.09 provides support for the null hypothesis. The correct
interpretation is that the Bayes factor provides relative support for H0
compared to H1. With a different alternative model, the Bayes factor would
change. As with a signiifcant equivalence tests, even a Bayes factor strongly
supporting H0 does not mean there is no effect at all – there could be a true,
but small, effect.
For example, after
Daryl Bem (2011) published 9 studies demonstrating support for pre-cognition
(conscious cognitive awareness of a future event that could not otherwise be
known) a team of Bayesian statisticians re-analyzed the studies, and concluded
“Out of the 10 critical tests, only one yields “substantial” evidence for H1,
whereas three yield “substantial” evidence in favor of H0. The results of the
remaining six tests provide evidence that is only “anecdotal”” (2011). In a
reply, Bem and Utts (2011) reply by arguing that the set of studies provide
convincing evidence for the alternative hypothesis, if the Bayes factors are
computed as relative evidence between the null hypothesis and a more
realistically specified alternative hypothesis, where the effects of
pre-cognition are expected to be small. This back and forth illustrates how
Bayes factors are relative evidence, and a change in the alternative model
specification changes whether the null or the alternative hypothesis receives
relatively more support given the data.
Misunderstanding 3:
Not specifying the null and/or alternative model.
Given that Bayes
factors are relative evidence for or against one model compared to another
model, it might be surprising that many researchers fail to specify the alternative
model to begin with when reporting their analysis. And yet, in a systematic
review of how psychologist use Bayes factors, van de Schoot et al. (2017) found
that “31.1% of the articles did not even discuss the priors implemented”. Where
in a null hypothesis significance test researchers do not need to specify the
model that the test is based on, as the test is by definition a test against an
effect of 0, and the alternative model consists of any non-zero effect size (in
a two-sided test), this is not true when computing Bayes factors. The null
model when computing Bayes factors is often (but not necessarily) a point null
as in NHST, but the alternative model only one of many possible alternative
hypotheses that a researcher could test against. It has become common to use
‘default’ priors, but as with any heuristic, defaults will most often give an
answer to a nonsensical question, and quickly become a form of mindless
statistics. When introducing Bayes factors as an alternative to frequentist
t-tests, Rouder et al. (2009) write:
This commitment to
specify judicious and reasoned alternatives places a burden on the analyst. We
have provided default settings appropriate to generic situations. Nonetheless,
these recommendations are just that and should not be used blindly. Moreover,
analysts can and should consider their goals and expectations when specifying
priors. Simply put, principled inference is a thoughtful process that cannot be
performed by rigid adherence to defaults.
The priors used when
computing a Bayes factor should therefore be both specified and justified.
Misunderstanding 4:
Claims based on Bayes Factors do not require error control.
In a paper with the
provocative title “Optional stopping: No problem for Bayesians” Rouder (2014)
argues that “Researchers using Bayesian methods may employ optional stopping in
their own research and may provide Bayesian analysis of secondary data regardless
of the employed stopping rule.” If one would merely read the title and
abstract, a reader might come to the conclusion that Bayes factors a wonderful
solution to the error inflation due to optional stopping in the frequentist
framework, but this is not correct (de Heide & Grünwald, 2017).
There is a big caveat
about the type of statistical inferences that is unaffected by optional
stopping. Optional stopping is no problem for Bayesians if they refrain from
making a dichotomous claim about the presence or absence of an effect, or when
they refrain from drawing conclusions about a prediction being supported or
falsified. Rouder notes how “Even with optional stopping, a researcher can
interpret the posterior odds as updated beliefs about hypotheses in light of
data.” In other words, even after optional stopping, a Bayes factor tells
researchers who much they should update their belief in a hypothesis.
Importantly, when researchers make dichotomous claims based on Bayes factors
(e.g., “The effect did not differ significantly between the condition, BF10 =
0.17”) then this claim can be correct, or an error, and error rates become a
relevant consideration, unlike when researchers simply present the Bayes factor
for readers to update their personal beliefs.
Bayesians disagree
among each other about whether Bayes factors should be the basis of dichotomous
claims, or not. Those who promote the use of Bayes factors to make claims often
refer to thresholds proposed by Jeffreys (1939), where a BF > 3 is
“substantial evidence”, and a BF > 10 is considered “strong evidence”. Some
journals, such as Nature Human Behavior, have the following requirement for
researchers who submit a Registered Report: “For inference by Bayes factors,
authors must be able to guarantee data collection until the Bayes factor is at
least 10 times in favour of the experimental hypothesis over the null
hypothesis (or vice versa).” When researchers decide to collect data until a
specific threshold is crossed to make a claim about a test, their claim can be
correct, or wrong, just as when p-values are the statistical quantity a claim
is based on. As both the Bayes factor and the p-value can be computed based on
the sample size and the t-value (Francis, 2016; Rouder et al., 2009), there is
nothing special about using Bayes factors as the basis of an ordinal claim. The
exact long run error rates can not be directly controlled when computing Bayes
factors, and the Type 1 and Type 2 error rate depends on the choice of the
prior and the choice for the cut-off used to decide to make a claim.
Simulations studies show that for commonly used priors and a BF > 3 cut-off
to make claims the Type 1 error rate is somewhat smaller, but the Type 2 error
rate is considerably larger (Kelter, 2021).
To conclude this
section, whenever researchers make claims, they can make erroneous claims, and
error control should be a worthy goal. Error control is not a consideration
when researchers do not make ordinal claims (e.g., X is larger than Y, there is
a non-zero correlation between X and Y, etc). If Bayes factors are used to
quantify how much researchers should update personal beliefs in a hypothesis,
there is no need to consider error control, but researchers should also refrain
from making any ordinal claims based on Bayes factors in the results section or
the discussion section. Giving up error control also means giving up claims
about the presence or absence of effects.
Misunderstanding 5:
Interpret Bayes Factors as effect sizes.
Bayes factors are not
statements about the size of an effect. It is therefore not appropriate to
conclude that the effect size is small or large purely based on the Bayes
factor. Depending on the priors used when specifying the alternative and null
model, the same Bayes factor can be observed for very different effect size
estimates. The reverse is also true. The same effect size can correspond to
Bayes factors supporting the null or the alternative hypothesis, depending on
how the null model and the alternative model are specified. Researchers should
therefore always report and interpret effect size measure. Statements about the
size of effects should only be based on these effect size measures, and not on
Bayes factors.
Any tool for
statistical inferences will be mis-used, and the greater the adoption, the more
people will use a tool without proper training. Simplistic sales pitches for
Bayes factors (e.g., Bayes factors tell you the probability that your
hypothesis is true, Bayes factors do not require error control, you can use
‘default’ Bayes factors and do not have to think about your priors) contribute
to this misuse. When reviewing papers that report Bayes factors, check if the
authors use Bayes factors to draw correct inferences.
Bem, D. J. (2011).
Feeling the future: Experimental evidence for anomalous retroactive influences
on cognition and affect. Journal of Personality and Social Psychology, 100(3),
407–425. https://doi.org/10.1037/a0021524
Bem, D. J., Utts, J.,
& Johnson, W. O. (2011). Must psychologists change the way they analyze
their data? Journal of Personality and Social Psychology, 101(4), 716–719.
https://doi.org/10.1037/a0024777
Cohen, J. (1994). The
earth is round (p .05). American Psychologist, 49(12), 997–1003.
https://doi.org/10.1037/0003-066X.49.12.997
de Heide, R., &
Grünwald, P. D. (2017). Why optional stopping is a problem for Bayesians.
arXiv:1708.08278 [Math, Stat]. https://arxiv.org/abs/1708.08278
Francis, G. (2016).
Equivalent statistics and data interpretation. Behavior Research Methods, 1–15.
https://doi.org/10.3758/s13428-016-0812-3
Jeffreys, H. (1939).
Theory of probability (1st ed). Oxford University Press.
Kelter, R. (2021).
Analysis of type I and II error rates of Bayesian and frequentist parametric
and nonparametric two-sample hypothesis tests under preliminary assessment of
normality. Computational Statistics, 36(2), 1263–1288.
https://doi.org/10.1007/s00180-020-01034-7
Rouder, J. N. (2014).
Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review,
21(2), 301–308.
Rouder, J. N.,
Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t
tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin
& Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
van de Schoot, R.,
Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A
systematic review of Bayesian articles in psychology: The last 25 years.
Psychological Methods, 22(2), 217–239. https://doi.org/10.1037/met0000100
Wagenmakers, E.-J.,
Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why
psychologists must change the way they analyze their data: The case of psi:
Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3),
426–432. https://doi.org/10.1037/a0022790
Wong, T. K., Kiers,
H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function
of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1),
36357. https://doi.org/10.1525/collabra.36357