As more people have

started to use Bayes Factors, we should not be surprised that misconceptions

about Bayes Factors have become common. A recent study shows that the

percentage of scientific articles that draw incorrect inferences based on

observed Bayes Factors is distressingly high (Wong et al., 2022), with 92% of

articles demonstrating at least one misconception of Bayes Factors. Here I will

review some of the most common misconceptions, and how to prevent them.

**Misunderstanding 1:
Confusing Bayes Factors with Posterior Odds.**

One common criticism

by Bayesians of null hypothesis significance testing (NHST) is that NHST

quantifies the probability of the data (or more extreme data), given that the

null hypothesis is true, but that scientists should be interested in the

probability that the hypothesis is true, given the data. Cohen (1994) wrote:

*What’s wrong with
NHST? Well, among many other things, it does not tell us what we want to know,
and we so much want to know what we want to know that, out of desperation, we
nevertheless believe that it does! What we want to know is “Given these data,
what is the probability that Ho is true?”*

One might therefore

believe that Bayes factors tell us something about the probability that a

hypothesis true, but this is incorrect. A Bayes factor quantifies how much we

should update our belief in one hypothesis. If this hypothesis was extremely

unlikely (e.g., the probability that people have telepathy) this hypothesis

might still be very unlikely, even after computing a large Bayes factor in a

single study demonstrating telepathy. If we believed the hypothesis that people

have telepathy was unlikely to be true (e.g., we thought it was 99.9% certain

telepathy was not true) evidence for telepathy might only increase our belief

in telepathy to the extent that we now believe it is 98% unlikely. The Bayes

factor only corresponds to our posterior belief if we were perfectly uncertain

about the hypothesis being true or not. If both hypotheses were equally likely,

and a Bayes factor indicates we should update our belief in such a way that the

alternative hypothesis is three times more likely than the null hypothesis,

only then would we end up believing the alternative hypothesis is exactly three

times more likely than the null hypothesis. One should therefore not conclude

that, for example, given a BF of 10, the alternative hypothesis is more likely

to be true than the null hypothesis. The correct claim is that people should

update their belief in the alternative hypothesis by a factor of 10.

**Misunderstanding 2:
Failing to interpret Bayes Factors as relative evidence.**

One benefit of Bayes

factors that is often mentioned by Bayesians is that, unlike NHST, Bayes

factors can provide support for the null hypothesis, and thereby falsify

predictions. It is true that NHST can only reject the null hypothesis, although

it is important to add that in frequentist statistics equivalence tests can be

used to reject the alternative hypothesis, and therefore there is no need to

switch to Bayes factors to meaningfully interpret the results of

non-significant null hypothesis tests.

Bayes factors quantify

support for one hypothesis relative to another hypothesis. As with likelihood

ratios, it is possible that one hypothesis is supported more than another

hypothesis, while both hypotheses are actually false. It is incorrect to

interpret Bayes factors in an absolute manner, for example by stating that a

Bayes factor of 0.09 provides support for the null hypothesis. The correct

interpretation is that the Bayes factor provides relative support for H0

compared to H1. With a different alternative model, the Bayes factor would

change. As with a signiifcant equivalence tests, even a Bayes factor strongly

supporting H0 does not mean there is no effect at all – there could be a true,

but small, effect.

For example, after

Daryl Bem (2011) published 9 studies demonstrating support for pre-cognition

(conscious cognitive awareness of a future event that could not otherwise be

known) a team of Bayesian statisticians re-analyzed the studies, and concluded

“Out of the 10 critical tests, only one yields “substantial” evidence for H1,

whereas three yield “substantial” evidence in favor of H0. The results of the

remaining six tests provide evidence that is only “anecdotal”” (2011). In a

reply, Bem and Utts (2011) reply by arguing that the set of studies provide

convincing evidence for the alternative hypothesis, if the Bayes factors are

computed as relative evidence between the null hypothesis and a more

realistically specified alternative hypothesis, where the effects of

pre-cognition are expected to be small. This back and forth illustrates how

Bayes factors are relative evidence, and a change in the alternative model

specification changes whether the null or the alternative hypothesis receives

relatively more support given the data.

**Misunderstanding 3:
Not specifying the null and/or alternative model.**

Given that Bayes

factors are relative evidence for or against one model compared to another

model, it might be surprising that many researchers fail to specify the alternative

model to begin with when reporting their analysis. And yet, in a systematic

review of how psychologist use Bayes factors, van de Schoot et al. (2017) found

that “31.1% of the articles did not even discuss the priors implemented”. Where

in a null hypothesis significance test researchers do not need to specify the

model that the test is based on, as the test is by definition a test against an

effect of 0, and the alternative model consists of any non-zero effect size (in

a two-sided test), this is not true when computing Bayes factors. The null

model when computing Bayes factors is often (but not necessarily) a point null

as in NHST, but the alternative model only one of many possible alternative

hypotheses that a researcher could test against. It has become common to use

‘default’ priors, but as with any heuristic, defaults will most often give an

answer to a nonsensical question, and quickly become a form of mindless

statistics. When introducing Bayes factors as an alternative to frequentist

t-tests, Rouder et al. (2009) write:

This commitment to

specify judicious and reasoned alternatives places a burden on the analyst. We

have provided default settings appropriate to generic situations. Nonetheless,

these recommendations are just that and should not be used blindly. Moreover,

analysts can and should consider their goals and expectations when specifying

priors. Simply put, principled inference is a thoughtful process that cannot be

performed by rigid adherence to defaults.

The priors used when

computing a Bayes factor should therefore be both specified and justified.

**Misunderstanding 4:
Claims based on Bayes Factors do not require error control.**

In a paper with the

provocative title “Optional stopping: No problem for Bayesians” Rouder (2014)

argues that “Researchers using Bayesian methods may employ optional stopping in

their own research and may provide Bayesian analysis of secondary data regardless

of the employed stopping rule.” If one would merely read the title and

abstract, a reader might come to the conclusion that Bayes factors a wonderful

solution to the error inflation due to optional stopping in the frequentist

framework, but this is not correct (de Heide & Grünwald, 2017).

There is a big caveat

about the type of statistical inferences that is unaffected by optional

stopping. Optional stopping is no problem for Bayesians if they refrain from

making a dichotomous claim about the presence or absence of an effect, or when

they refrain from drawing conclusions about a prediction being supported or

falsified. Rouder notes how “Even with optional stopping, a researcher can

interpret the posterior odds as updated beliefs about hypotheses in light of

data.” In other words, even after optional stopping, a Bayes factor tells

researchers who much they should update their belief in a hypothesis.

Importantly, when researchers make dichotomous claims based on Bayes factors

(e.g., “The effect did not differ significantly between the condition, BF10 =

0.17”) then this claim can be correct, or an error, and error rates become a

relevant consideration, unlike when researchers simply present the Bayes factor

for readers to update their personal beliefs.

Bayesians disagree

among each other about whether Bayes factors should be the basis of dichotomous

claims, or not. Those who promote the use of Bayes factors to make claims often

refer to thresholds proposed by Jeffreys (1939), where a BF > 3 is

“substantial evidence”, and a BF > 10 is considered “strong evidence”. Some

journals, such as Nature Human Behavior, have the following requirement for

researchers who submit a Registered Report: “For inference by Bayes factors,

authors must be able to guarantee data collection until the Bayes factor is at

least 10 times in favour of the experimental hypothesis over the null

hypothesis (or vice versa).” When researchers decide to collect data until a

specific threshold is crossed to make a claim about a test, their claim can be

correct, or wrong, just as when p-values are the statistical quantity a claim

is based on. As both the Bayes factor and the p-value can be computed based on

the sample size and the t-value (Francis, 2016; Rouder et al., 2009), there is

nothing special about using Bayes factors as the basis of an ordinal claim. The

exact long run error rates can not be directly controlled when computing Bayes

factors, and the Type 1 and Type 2 error rate depends on the choice of the

prior and the choice for the cut-off used to decide to make a claim.

Simulations studies show that for commonly used priors and a BF > 3 cut-off

to make claims the Type 1 error rate is somewhat smaller, but the Type 2 error

rate is considerably larger (Kelter, 2021).

To conclude this

section, whenever researchers make claims, they can make erroneous claims, and

error control should be a worthy goal. Error control is not a consideration

when researchers do not make ordinal claims (e.g., X is larger than Y, there is

a non-zero correlation between X and Y, etc). If Bayes factors are used to

quantify how much researchers should update personal beliefs in a hypothesis,

there is no need to consider error control, but researchers should also refrain

from making any ordinal claims based on Bayes factors in the results section or

the discussion section. Giving up error control also means giving up claims

about the presence or absence of effects.

**Misunderstanding 5:
Interpret Bayes Factors as effect sizes.**

Bayes factors are not

statements about the size of an effect. It is therefore not appropriate to

conclude that the effect size is small or large purely based on the Bayes

factor. Depending on the priors used when specifying the alternative and null

model, the same Bayes factor can be observed for very different effect size

estimates. The reverse is also true. The same effect size can correspond to

Bayes factors supporting the null or the alternative hypothesis, depending on

how the null model and the alternative model are specified. Researchers should

therefore always report and interpret effect size measure. Statements about the

size of effects should only be based on these effect size measures, and not on

Bayes factors.

Any tool for

statistical inferences will be mis-used, and the greater the adoption, the more

people will use a tool without proper training. Simplistic sales pitches for

Bayes factors (e.g., Bayes factors tell you the probability that your

hypothesis is true, Bayes factors do not require error control, you can use

‘default’ Bayes factors and do not have to think about your priors) contribute

to this misuse. When reviewing papers that report Bayes factors, check if the

authors use Bayes factors to draw correct inferences.

Bem, D. J. (2011).

Feeling the future: Experimental evidence for anomalous retroactive influences

on cognition and affect. Journal of Personality and Social Psychology, 100(3),

407–425. https://doi.org/10.1037/a0021524

Bem, D. J., Utts, J.,

& Johnson, W. O. (2011). Must psychologists change the way they analyze

their data? Journal of Personality and Social Psychology, 101(4), 716–719.

https://doi.org/10.1037/a0024777

Cohen, J. (1994). The

earth is round (p .05). American Psychologist, 49(12), 997–1003.

https://doi.org/10.1037/0003-066X.49.12.997

de Heide, R., &

Grünwald, P. D. (2017). Why optional stopping is a problem for Bayesians.

arXiv:1708.08278 [Math, Stat]. https://arxiv.org/abs/1708.08278

Francis, G. (2016).

Equivalent statistics and data interpretation. Behavior Research Methods, 1–15.

https://doi.org/10.3758/s13428-016-0812-3

Jeffreys, H. (1939).

Theory of probability (1st ed). Oxford University Press.

Kelter, R. (2021).

Analysis of type I and II error rates of Bayesian and frequentist parametric

and nonparametric two-sample hypothesis tests under preliminary assessment of

normality. Computational Statistics, 36(2), 1263–1288.

https://doi.org/10.1007/s00180-020-01034-7

Rouder, J. N. (2014).

Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review,

21(2), 301–308.

Rouder, J. N.,

Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t

tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin

& Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225

van de Schoot, R.,

Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A

systematic review of Bayesian articles in psychology: The last 25 years.

Psychological Methods, 22(2), 217–239. https://doi.org/10.1037/met0000100

Wagenmakers, E.-J.,

Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why

psychologists must change the way they analyze their data: The case of psi:

Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3),

426–432. https://doi.org/10.1037/a0022790

Wong, T. K., Kiers,

H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function

of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1),

36357. https://doi.org/10.1525/collabra.36357