Preventing common misconceptions about Bayes Factors

As more people have
started to use Bayes Factors, we should not be surprised that misconceptions
about Bayes Factors have become common. A recent study shows that the
percentage of scientific articles that draw incorrect inferences based on
observed Bayes Factors is distressingly high (Wong et al., 2022), with 92% of
articles demonstrating at least one misconception of Bayes Factors. Here I will
review some of the most common misconceptions, and how to prevent them.

Misunderstanding 1:
Confusing Bayes Factors with Posterior Odds.

One common criticism
by Bayesians of null hypothesis significance testing (NHST) is that NHST
quantifies the probability of the data (or more extreme data), given that the
null hypothesis is true, but that scientists should be interested in the
probability that the hypothesis is true, given the data. Cohen (1994) wrote:

What’s wrong with
NHST? Well, among many other things, it does not tell us what we want to know,
and we so much want to know what we want to know that, out of desperation, we
nevertheless believe that it does! What we want to know is “Given these data,
what is the probability that Ho is true?”

One might therefore
believe that Bayes factors tell us something about the probability that a
hypothesis true, but this is incorrect. A Bayes factor quantifies how much we
should update our belief in one hypothesis. If this hypothesis was extremely
unlikely (e.g., the probability that people have telepathy) this hypothesis
might still be very unlikely, even after computing a large Bayes factor in a
single study demonstrating telepathy. If we believed the hypothesis that people
have telepathy was unlikely to be true (e.g., we thought it was 99.9% certain
telepathy was not true) evidence for telepathy might only increase our belief
in telepathy to the extent that we now believe it is 98% unlikely. The Bayes
factor only corresponds to our posterior belief if we were perfectly uncertain
about the hypothesis being true or not. If both hypotheses were equally likely,
and a Bayes factor indicates we should update our belief in such a way that the
alternative hypothesis is three times more likely than the null hypothesis,
only then would we end up believing the alternative hypothesis is exactly three
times more likely than the null hypothesis. One should therefore not conclude
that, for example, given a BF of 10, the alternative hypothesis is more likely
to be true than the null hypothesis. The correct claim is that people should
update their belief in the alternative hypothesis by a factor of 10.

Misunderstanding 2:
Failing to interpret Bayes Factors as relative evidence.

One benefit of Bayes
factors that is often mentioned by Bayesians is that, unlike NHST, Bayes
factors can provide support for the null hypothesis, and thereby falsify
predictions. It is true that NHST can only reject the null hypothesis, although
it is important to add that in frequentist statistics equivalence tests can be
used to reject the alternative hypothesis, and therefore there is no need to
switch to Bayes factors to meaningfully interpret the results of
non-significant null hypothesis tests.

Bayes factors quantify
support for one hypothesis relative to another hypothesis. As with likelihood
ratios, it is possible that one hypothesis is supported more than another
hypothesis, while both hypotheses are actually false. It is incorrect to
interpret Bayes factors in an absolute manner, for example by stating that a
Bayes factor of 0.09 provides support for the null hypothesis. The correct
interpretation is that the Bayes factor provides relative support for H0
compared to H1. With a different alternative model, the Bayes factor would
change. As with a signiifcant equivalence tests, even a Bayes factor strongly
supporting H0 does not mean there is no effect at all – there could be a true,
but small, effect.

For example, after
Daryl Bem (2011) published 9 studies demonstrating support for pre-cognition
(conscious cognitive awareness of a future event that could not otherwise be
known) a team of Bayesian statisticians re-analyzed the studies, and concluded
“Out of the 10 critical tests, only one yields “substantial” evidence for H1,
whereas three yield “substantial” evidence in favor of H0. The results of the
remaining six tests provide evidence that is only “anecdotal”” (2011). In a
reply, Bem and Utts (2011) reply by arguing that the set of studies provide
convincing evidence for the alternative hypothesis, if the Bayes factors are
computed as relative evidence between the null hypothesis and a more
realistically specified alternative hypothesis, where the effects of
pre-cognition are expected to be small. This back and forth illustrates how
Bayes factors are relative evidence, and a change in the alternative model
specification changes whether the null or the alternative hypothesis receives
relatively more support given the data.

Misunderstanding 3:
Not specifying the null and/or alternative model.

Given that Bayes
factors are relative evidence for or against one model compared to another
model, it might be surprising that many researchers fail to specify the alternative
model to begin with when reporting their analysis. And yet, in a systematic
review of how psychologist use Bayes factors, van de Schoot et al. (2017) found
that “31.1% of the articles did not even discuss the priors implemented”. Where
in a null hypothesis significance test researchers do not need to specify the
model that the test is based on, as the test is by definition a test against an
effect of 0, and the alternative model consists of any non-zero effect size (in
a two-sided test), this is not true when computing Bayes factors. The null
model when computing Bayes factors is often (but not necessarily) a point null
as in NHST, but the alternative model only one of many possible alternative
hypotheses that a researcher could test against. It has become common to use
‘default’ priors, but as with any heuristic, defaults will most often give an
answer to a nonsensical question, and quickly become a form of mindless
statistics. When introducing Bayes factors as an alternative to frequentist
t-tests, Rouder et al. (2009) write:

This commitment to
specify judicious and reasoned alternatives places a burden on the analyst. We
have provided default settings appropriate to generic situations. Nonetheless,
these recommendations are just that and should not be used blindly. Moreover,
analysts can and should consider their goals and expectations when specifying
priors. Simply put, principled inference is a thoughtful process that cannot be
performed by rigid adherence to defaults.

The priors used when
computing a Bayes factor should therefore be both specified and justified.

Misunderstanding 4:
Claims based on Bayes Factors do not require error control.

In a paper with the
provocative title “Optional stopping: No problem for Bayesians” Rouder (2014)
argues that “Researchers using Bayesian methods may employ optional stopping in
their own research and may provide Bayesian analysis of secondary data regardless
of the employed stopping rule.” If one would merely read the title and
abstract, a reader might come to the conclusion that Bayes factors a wonderful
solution to the error inflation due to optional stopping in the frequentist
framework, but this is not correct (de Heide & Grünwald, 2017).

There is a big caveat
about the type of statistical inferences that is unaffected by optional
stopping. Optional stopping is no problem for Bayesians if they refrain from
making a dichotomous claim about the presence or absence of an effect, or when
they refrain from drawing conclusions about a prediction being supported or
falsified. Rouder notes how “Even with optional stopping, a researcher can
interpret the posterior odds as updated beliefs about hypotheses in light of
data.” In other words, even after optional stopping, a Bayes factor tells
researchers who much they should update their belief in a hypothesis.
Importantly, when researchers make dichotomous claims based on Bayes factors
(e.g., “The effect did not differ significantly between the condition, BF10 =
0.17”) then this claim can be correct, or an error, and error rates become a
relevant consideration, unlike when researchers simply present the Bayes factor
for readers to update their personal beliefs.

Bayesians disagree
among each other about whether Bayes factors should be the basis of dichotomous
claims, or not. Those who promote the use of Bayes factors to make claims often
refer to thresholds proposed by Jeffreys (1939), where a BF > 3 is
“substantial evidence”, and a BF > 10 is considered “strong evidence”. Some
journals, such as Nature Human Behavior, have the following requirement for
researchers who submit a Registered Report: “For inference by Bayes factors,
authors must be able to guarantee data collection until the Bayes factor is at
least 10 times in favour of the experimental hypothesis over the null
hypothesis (or vice versa).” When researchers decide to collect data until a
specific threshold is crossed to make a claim about a test, their claim can be
correct, or wrong, just as when p-values are the statistical quantity a claim
is based on. As both the Bayes factor and the p-value can be computed based on
the sample size and the t-value (Francis, 2016; Rouder et al., 2009), there is
nothing special about using Bayes factors as the basis of an ordinal claim. The
exact long run error rates can not be directly controlled when computing Bayes
factors, and the Type 1 and Type 2 error rate depends on the choice of the
prior and the choice for the cut-off used to decide to make a claim.
Simulations studies show that for commonly used priors and a BF > 3 cut-off
to make claims the Type 1 error rate is somewhat smaller, but the Type 2 error
rate is considerably larger (Kelter, 2021).

To conclude this
section, whenever researchers make claims, they can make erroneous claims, and
error control should be a worthy goal. Error control is not a consideration
when researchers do not make ordinal claims (e.g., X is larger than Y, there is
a non-zero correlation between X and Y, etc). If Bayes factors are used to
quantify how much researchers should update personal beliefs in a hypothesis,
there is no need to consider error control, but researchers should also refrain
from making any ordinal claims based on Bayes factors in the results section or
the discussion section. Giving up error control also means giving up claims
about the presence or absence of effects.

Misunderstanding 5:
Interpret Bayes Factors as effect sizes.

Bayes factors are not
statements about the size of an effect. It is therefore not appropriate to
conclude that the effect size is small or large purely based on the Bayes
factor. Depending on the priors used when specifying the alternative and null
model, the same Bayes factor can be observed for very different effect size
estimates. The reverse is also true. The same effect size can correspond to
Bayes factors supporting the null or the alternative hypothesis, depending on
how the null model and the alternative model are specified. Researchers should
therefore always report and interpret effect size measure. Statements about the
size of effects should only be based on these effect size measures, and not on
Bayes factors.

Any tool for
statistical inferences will be mis-used, and the greater the adoption, the more
people will use a tool without proper training. Simplistic sales pitches for
Bayes factors (e.g., Bayes factors tell you the probability that your
hypothesis is true, Bayes factors do not require error control, you can use
‘default’ Bayes factors and do not have to think about your priors) contribute
to this misuse. When reviewing papers that report Bayes factors, check if the
authors use Bayes factors to draw correct inferences.

Bem, D. J. (2011).
Feeling the future: Experimental evidence for anomalous retroactive influences
on cognition and affect. Journal of Personality and Social Psychology, 100(3),
407–425. https://doi.org/10.1037/a0021524

Bem, D. J., Utts, J.,
& Johnson, W. O. (2011). Must psychologists change the way they analyze
their data? Journal of Personality and Social Psychology, 101(4), 716–719.
https://doi.org/10.1037/a0024777

Cohen, J. (1994). The
earth is round (p .05). American Psychologist, 49(12), 997–1003.
https://doi.org/10.1037/0003-066X.49.12.997

de Heide, R., &
Grünwald, P. D. (2017). Why optional stopping is a problem for Bayesians.
arXiv:1708.08278 [Math, Stat]. https://arxiv.org/abs/1708.08278

Francis, G. (2016).
Equivalent statistics and data interpretation. Behavior Research Methods, 1–15.
https://doi.org/10.3758/s13428-016-0812-3

Jeffreys, H. (1939).
Theory of probability (1st ed). Oxford University Press.

Kelter, R. (2021).
Analysis of type I and II error rates of Bayesian and frequentist parametric
and nonparametric two-sample hypothesis tests under preliminary assessment of
normality. Computational Statistics, 36(2), 1263–1288.
https://doi.org/10.1007/s00180-020-01034-7

Rouder, J. N. (2014).
Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review,
21(2), 301–308.

Rouder, J. N.,
Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t
tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin
& Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225

van de Schoot, R.,
Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A
systematic review of Bayesian articles in psychology: The last 25 years.
Psychological Methods, 22(2), 217–239. https://doi.org/10.1037/met0000100

Wagenmakers, E.-J.,
Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why
psychologists must change the way they analyze their data: The case of psi:
Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3),
426–432. https://doi.org/10.1037/a0022790

Wong, T. K., Kiers,
H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function
of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1),
36357. https://doi.org/10.1525/collabra.36357

P-values vs. Bayes Factors

In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?

What is the goal of statistical inference?

When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.

Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.

The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.

A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still  occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130  are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).

Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.

In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).

When do you think the other approach may be applicable?

Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.

Conclusion

In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.