P-values vs. Bayes Factors

In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?

 

What is the goal of statistical inference?

 

When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.

 

Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.

 

The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.

 

What is the advantage of your approach in a practical/applied context?

 

A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still  occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130  are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).

 

Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.

 

In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).

 

When do you think the other approach may be applicable?

 

Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.

 

Conclusion

 

In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.

What’s a family in family-wise error control?

When you perform multiple comparisons in a study, you need to control your alpha level for multiple comparisons. It is generally recommended to control for the family-wise error rate, but there is some confusion about what a ‘family’ is. As Bretz, Hothorn, & Westfall (2011) write in their excellent book “Multiple Comparisons Using R” on page 15: “The appropriate choice of null hypotheses being of primary interest is a controversial question. That is, it is not always clear which set of hypotheses should constitute the family H1,…,Hm. This topic has often been in dispute and there is no general consensus.” In one of the best papers on controlling for multiple comparisons out there, Bender & Lange (2001) write: “Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions. In addition to the problem of deciding which error rate should be under control, it has to be defined first which tests of a study belong to one experiment.” The Wikipedia page on family-wise error rate is a mess.

I will be honest: I have never understood this confusion about what a family of tests is when controlling the family-wise error rate. At least not in a Neyman-Pearson approach to hypothesis testing, where the goal is to use data to make decisions about how to act. Neyman (Neyman, 1957) calls his approach inductive behavior. The outcome of an experiment leads one to take different possible actions, which can be either practical (e.g., implement a new procedure, abandon a research line) or scientific (e.g., claim there is or is no effect). From an error-statistical approach (Mayo, 2018) inflated Type 1 error rates mean that it has become very likely that you will be able to claim support for your hypothesis, even when the hypothesis is wrong. This reduces the severity of the test. To prevent this, we need to control our error rate at the level of our claim.
One reason the issue of family-wise error rates might remain vague, is that researchers are often vague about their claims. We do not specify our hypotheses unambiguously, and therefore this issue remains unclear. To be honest, I suspect another reason there is a continuing debate about whether and how to lower the alpha level to control for multiple comparisons in some disciplines is that 1) there are a surprisingly large number of papers written on this topic that argue you do not need to control for multiple comparisons, which are 2) cited a huge number of times giving rise to the feeling that surely they must have a point. Regrettably, the main reason these papers are written is because there are people who don’t think a Neyman-Pearson approach to hypothesis testing is a good idea, and the main reason these papers are cited is because doing so is convenient for researchers who want to publish statistically significant results, as they can justify why they are not lowering their alpha level, making that p = 0.02 in one of three tests really ‘significant’. All papers that argue against the need to control for multiple comparisons when testing hypotheses are wrong.  Yes, their existence and massive citation counts frustrate me. It is fine not to test a hypothesis, but when you do, and you make a claim based on a test, you need to control your error rates. 

But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.

An example of formalizing family-wise error control

Let’s assume we collect data from 100 participants in a control and treatment condition. We collect 3 dependent variables (dv1, dv2, and dv3). In the population there is no difference between groups on any of these three variables (the true effect size is 0). We will analyze the three dv’s in independent t-tests. This requires specifying our alpha level, and thus deciding whether we need to correct for multiple comparisons. How we control error rates depends on claim we want to make.
We might want to act as if (or claim that) our treatment works if there is a difference between the treatment and control conditions on any of the three variables. In scienceverse terms, this means we consider the prediction corroborated when the p-value of the first t-test is smaller than alpha level, the p-value of the second t-test is smaller than the alpha level, or the p-value of the first t-test is smaller than the alpha level. In the scienceverse code, we specify a criterion for each test (a p-value smaller than the alpha level, p.value < alpha_level) and conclude the hypothesis is corroborated if either of these criteria are met (“p_t_1 | p_t_2 | p_t_3”).  
We could also want to make three different predictions. Instead of one hypothesis (“something will happen”) we have three different hypotheses, and predict there will be an effect on dv1, dv2, and dv3. The criterion for each t-test is the same, but we now have three hypotheses to evaluate (H1, H2, and H3). Each of these claims can be corroborated, or not.
Scienceverse allows you to specify your hypotheses tests unambiguously (for code used in this blog, see the bottom of the post). It also allows you to simulate a dataset, which we can use to examine Type 1 errors by simulating data where no true effects exist. Finally, scienceverse allows you to run the pre-specified analyses on the (simulated) data, and will automatically create a report that summarizes which hypotheses were corroborated (which is useful when checking if the conclusions in a manuscript indeed follow from the preregistered analyses, or not). The output a single simulated dataset for the scenario where we will interpret any effect on the three dv’s as support for the hypothesis looks like this:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

Something will happen

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if anything is significant.

 p_t_1 | p_t_2 | p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if nothing is significant.

 !p_t_1 & !p_t_2 & !p_t_3 

All criteria were met for corroboration.

We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 – even though this was a Type 1 error, since we simulated data with a true effect size of 0 – and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

dv1 will show an effect

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv1 is significant.

 p_t_1 

Falsification ( TRUE )

The hypothesis is falsified if dv1 is not significant.

 !p_t_1 

All criteria were met for falsification.

Hypothesis 2: H2

dv2 will show an effect

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv2 is significant.

 p_t_2 

Falsification ( TRUE )

The hypothesis is falsified if dv2 is not significant.

 !p_t_2 

All criteria were met for falsification.

Hypothesis 3: H3

dv3 will show an effect

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if dv3 is significant.

 p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if dv3 is not significant.

 !p_t_3 

All criteria were met for corroboration.

We now see that two hypotheses were falsified (yes, yes, I know you should not use p > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don’t also have to explain equivalence testing to readers not familiar with it – if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.

It might seem that the second approach, specifying each dv as it’s own hypothesis, is the way to go if you do not want to lower the alpha level to control for multiple comparisons. But take a look at the report of the study you have performed. You have made 3 predictions, of which 1 was corroborated. That is not an impressive success rate. Sure, mixed results happen, and you should interpret results not just based on the p-value (but on the strength of the experimental design, assumptions about power, your prior, the strength of the theory, etc.) but if these predictions were derived from the same theory, this set of results is not particularly impressive. Since researchers can never selectively report only those results that ‘work’ because this would be a violation of the code of research integrity, we should always be able to see the meager track record of predictions.If you don’t feel ready to make a specific predictions (and run the risk of sullying your track record) either do unplanned exploratory tests, and do not make claims based on their results, or preregister all possible tests you can think of, and massively lower your alpha level to control error rates (for example, genome-wide association studies sometimes use an alpha level of 5 x 10–8 to control the Type 1 erorr rate).

Hopefully, specifying our hypotheses (and what would corroborate them) transparently by using scienceverse makes it clear what happens in the long run in both scenarios. In the long run, both the first scenario, if we would use an alpha level of 0.05/3 instead of 0.05, and the second scenario, with an alpha level of 0.05 for each individual hypothesis, will lead to the same end result: Not more than 5% of our claims will be wrong, if the null hypothesis is true. In the first scenario, we are making one claim in an experiment, and in the second we make three. In the second scenario we will end up with more false claims in an absolute sense, but the relative number of false claims is the same in both scenarios. And that’s exactly the goal of family-wise error control.
References
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. CRC Press.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.
Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7. https://doi.org/10.2307/1401671

Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.


Improving Education about P-values

A recent paper in AMPPS points out that many textbooks for introduction to psychology courses incorrectly explain p-values. There are dozens, if not hundreds, of papers that point out problems in how people understand p-values. If we don’t do anything about it, there will be dozens of articles like this in the next decades as well. So let’s do something about it.
When I made my first MOOC three years ago I spent some time thinking about how to explain what a p-value is clearly (you can see my video here). Some years later I realized that if you want to prevent misunderstandings of p-values, you should also explicitly train people about what p-values are not. Now, I think that training away misconceptions is just as important as explaining the correct interpretation of a p-value. Based on a blog post I made a new assignment for my MOOC. In the last year Arianne Herrera-Bennett (@ariannechb) performed an A/B test in my MOOC ‘Improving Your Statistical Inferences’. Half of the learners received this new assignment, explicitly aimed at training away misconceptions. The results are in her PhD thesis that she will defend on the 27th of September, 2019, but one of the main conclusions in the study is that it is possible to substantially reduce common misconceptions about p-values by educating people about them. This is a hopeful message.
I tried to keep the assignment as short as possible, and therefore it is 20 pages. Let that sink in for a moment. How much space does education about p-values take up in your study material? How much space would you need to prevent misunderstandings? And how often would you need to repeat the same material across the years? If we honestly believe misunderstanding of p-values are a problem, then why don’t we educate people well enough to prevent misunderstandings? The fact that people do not understand p-values is not their mistake – it is ours.
In my own MOOC I needed 7 pages to explain what p-value distributions look like, how they are a function of power, why p-values are uniformly distributed when the null is true, and what Lindley’s paradox is. But when I tried to clearly explain common misconceptions, I needed a lot more words. Before you want to blame that poor p-value, let me tell you that I strongly believe the problem of misconceptions is not limited to p-values: Probability is just not intuitive. It might always take more time to explain ways you can misunderstand something, than to teach the correct way to understand something.
In a recent pre-print I wrote on p-values, I reflect on the bad job we have been doing at teaching others about p-values. I write:
If anyone seriously believes the misunderstanding of p-values lies at the heart of reproducibility issues in science, why are we not investing more effort to make sure misunderstandings of p-values are resolved before young scholars perform their first research project? Although I am sympathetic to statisticians who think all the information researchers need to educate themselves on this topic is already available, as an experimental psychologist who works at a Human-Technology Interaction department this reminds me too much of the engineer who argues all the information to understand the copy machine is available in the user manual. In essence, the problems we have with how p-values are used is a human factors problem (Tryon, 2001). The challenge is to get researchers to improve the way they work.
Looking at the deluge of papers published in the last half century that point out how researchers have consistently misunderstood p-values, I am left to wonder: Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? It is nowadays relatively straightforward to create online apps where people can simulate studies and see the behavior of p-values across studies, which can easily be combined with exercises that fit the knowledge level of bachelor and master students. The second point I want to make in this article is that a dedicated attempt to develop evidence based educational material in a cross-disciplinary team of statisticians, educational scientists, cognitive psychologists, and designers seems worth the effort if we really believe young scholars should understand p-values. I do not think that the effort statisticians have made to complain about p-values is matched with a similar effort to improve the way researchers use p-values and hypothesis tests. We really have not tried hard enough.
So how about we get serious about solving this problem? Let’s get together and make a dent in this decade old problem. Let’s try hard enough.
A good place to start might be to take stock of good ways to educate people about p-values that already exist, and then all together see how we can improve them.
I have uploaded my lecture about p-values to YouTube, and my assignment to train away misconceptions is available online as a Google Doc (the answers and feedback is here).
This is just my current approach to teaching p-values. I am sure there are many other approaches (and it might turn out that watching several videos, each explaining p-values in slightly different ways, is an even better way to educate people than having only one video). If anyone wants to improve this material (or replace it by better material) I am willing to open up my online MOOC for anyone who wants to do an A/B test of any good idea, so you can collect data from hundreds of students each year. I’m more than happy to collect best practices in p-value education – if you have anything you think (or have empirically shown) works well, send it my way – and make it openly available. Educators, pedagogists, statisticians, cognitive psychologists, software engineers, and designers interested in improving educational materials should find a place to come together. I know there are organizations that exist to improve statistics education (but have no good information about what they do, or which one would be best to join given my goals), and if you work for such an organization and are interested in taking p-value education to the next level, I’m more than happy to spread this message in my network and work with you.
If we really consider the misinterpretation of p-values to be one of the more serious problems underlying the lack of replicability of scientific findings, we need to seriously reflect on whether we have done enough to prevent misunderstandings. Treating it as a human factors problem might illuminate ways in which statistics education and statistical software can be improved. Let’s beat swords into ploughshares, and turn papers complaining about how people misunderstand p-values into papers that examine how we can improve education about p-values.

Justify Your Alpha by Decreasing Alpha Levels as a Function of the Sample Size

A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends and improves the ideas in this blog post is available at: https://psyarxiv.com/ts4r6  
 
Testing whether observed data should surprise us, under the assumption that some model of the data is true, is a widely used procedure in psychological science. Tests against a null model, or against the smallest effect size of interest for an equivalence test, can guide your decisions to continue or abandon research lines. Seeing whether a p-value is smaller than an alpha level is rarely the only thing you want to do, but especially early on in experimental research lines where you can randomly assign participants to conditions, it can be a useful thing.

Regrettably, this procedure is performed rather mindlessly. Doing Neyman-Pearson hypothesis testing well, you should carefully think about the error rates you find acceptable. How often do you want to miss the smallest effect size you care about, if it is really there? And how often do you want to say there is an effect, but actually be wrong? It is important to justify your error rates when designing an experiment. In this post I will provide one justification for setting the alpha level (something we recommended makes more sense than using a fixed alpha level).

Papers explaining how to justify your alpha level are very rare (for an example, see Mudge, Baker, Edge, & Houlahan, 2012). Here I want to discuss one of the least known, but easiest suggestions on how to justify alpha levels in the literature, proposed by Good. The idea is simple, and has been supported by many statisticians in the last 80 years: Lower the alpha level as a function of your sample size.

The idea behind this recommendation is most extensively discussed in a book by Leamer (1978, p. 92). He writes:

The rule of thumb quite popular now, that is, setting the significance level arbitrarily to .05, is shown to be deficient in the sense that from every reasonable viewpoint the significance level should be a decreasing function of sample size.

Leamer (you can download his book for free) correctly notes that this behavior, an alpha level that is a decreasing function of the sample size, makes sense from both a Bayesian as a Neyman-Pearson perspective. Let me explain.

Imagine a researcher who performs a study that has 99.9% power to detect the smallest effect size the researcher is interested in, based on a test with an alpha level of 0.05. Such a study also has 99.8% power when using an alpha level of 0.03. Feel free to follow along here, by setting the sample size to 204, the effect size to 0.5, alpha or p-value (upper limit) to 0.05, and the p-value (lower limit) to 0.03.

We see that if the alternative hypothesis is true only 0.1% of the observed studies will, in the long run, observe a p-value between 0.03 and 0.05. When the null-hypothesis is true 2% of the studies will, in the long run, observe a p-value between 0.03 and 0.05. Note how this makes p-values between 0.03 and 0.05 more likely when there is no true effect, than when there is an effect. This is known as Lindley’s paradox (and I explain this in more detail in Assignment 1 in my MOOC, which you can also do here).

Although you can argue that you are still making a Type 1 error at most 5% of the time in the above situation, I think it makes sense to acknowledge there is something weird about having a Type 1 error of 5% when you have a Type 2 error of 0.1% (again, see Mudge, Baker, Edge, & Houlahan, 2012, who suggest balancing error rates). To me, it makes sense to design a study where error rates are more balanced, and a significant effect is declared for p-values more likely to occur when the alternative model is true than when the null model is true.

Because power increases as the sample size increases, and because Lindley’s paradox (Lindley, 1957, see also Cousins, 2017) can be prevented by lowering the alpha level sufficiently, the idea to lower the significance level as a function of the sample is very reasonable. But how?

Zellner (1971) discusses how the critical value for a frequentist hypothesis test approaches a limit as the sample size increases (i.e., a critical value of 1.96 for p = 0.05 in a two-sided test) whereas the critical value for a Bayes factor increases as the sample size increases (see also Rouder, Speckman, Sun, Morey, & Iverson, 2009). This difference lies at the heart of Lindley’s paradox, and under certain assumptions comes down to a factor of ?n. As Zellner (1971, footnote 19, page 304) writes (K01 is the formula for the Bayes factor):

If a sampling theorist were to adjust his significance level upward as n grows larger, which seems reasonable, za would grow with n and tend to counteract somewhat the influence of the ?n factor in the expression for K01.

Jeffreys (1939) discusses Neyman and Pearson’s work and writes:

We should therefore get the best result, with any distribution of ?, by some form that makes the ratio of the critical value to the standard error increase with n. It appears then that whatever the distribution may be, the use of a fixed P limit cannot be the one that will make the smallest number of mistakes.

He discusses the issue more in Appendix B, where he compared his own test (Bayes factors) against Neyman-Pearson decision procedures, and he notes that:

In spite of the difference in principle between my tests and those based on the P integrals, and the omission of the latter to give the increase of the critical values for large n, dictated essentially by the fact that in testing a small departure found from a large number of observations we are selecting a value out of a long range and should allow for selection, it appears that there is not much difference in the practical recommendations. Users of these tests speak of the 5 per cent. point in much the same way as I should speak of the K = 10 point, and of the 1 per cent. point as I should speak of the K = I0-1 point; and for moderate numbers of observations the points are not very different. At large numbers of observations there is a difference, since the tests based on the integral would sometimes assert significance at departures that would actually give K > I. Thus there may be opposite decisions in such cases. But they will be very rare.

So even though extremely different conclusions between Bayes factors and frequentist tests will be rare, according to Jeffreys, when the sample size grows, the difference becomes noticeable.

This brings us to Good’s (1982) easy solution. His paper is basically just a single page (I’d love something akin to a Comments, Conjectures, and Conclusions format in Meta-Psychology! – note that Good himself was the section editor, which started with ‘Please be succinct but lucid and interesting’, and it reads just like a blog post).

He also explains the rationale in Good (1992):

‘we have empirical evidence that sensible P values are related to weights of evidence and, therefore, that P values are not entirely without merit. The real objection to P values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.

Based on the observation by Jeffrey’s (1939) that, under specific circumstances, the Bayes factor against the null-hypothesis is approximately inversely proportional to ?N, Good (1982) suggests a standardized p-value to bring p-values in closer relationship with weights of evidence:

This formula standardizes the p-value to the evidence against the null hypothesis that what would be found if the pstan-value was the tail area probability observed in a sample of 100 participants (I think the formula is only intended for between designs – I would appreciate anyone weighing in in the comments if it can be extended to within-designs). When the sample size is 100, the p-value and pstan are identical. But for larger sample sizes pstan is larger than p. For example, a p = .05 observed in a sample size of 500 would have a pstan of 0.11, which is not enough to reject the null-hypothesis for the alternative. Good (1988) demonstrates great insight when he writes: ‘I guess that standardized p-values will not become standard before the year 2000.’


Good doesn’t give a lot of examples of how standardized p-values should be used in practice, but I guess it makes things easier to think about a standardized alpha level (even though the logic is the same, just like you can double the p-value, or halve the alpha level, when you are correcting for 2 comparisons in a Bonferroni correction). So instead of an alpha level of 0.05, we can think of a standardized alpha level:
Again, with 100 participants ? and ?stan are the same, but as the sample size increases above 100, the alpha level becomes smaller. For example, a ? = .05 observed in a sample size of 500 would have a ?stan of 0.02236.

So one way to justify your alpha level is by using a decreasing alpha level as the sample size increases. I for one have always thought it was rather nonsensical to use an alpha level of 0.05 in all meta-analyses (especially when testing a meta-analytic effect size based on thousands of participants against zero), or large collaborative research project such as Many Labs, where analyses are performed on very large samples. If you have thousands of participants, you have extremely high power for most effect sizes original studies could have detected in a significance test. With such a low Type 2 error rate, why keep the Type 1 error rate fixed at 5%, which is so much larger than the Type 2 error rate in these analyses? It just doesn’t make any sense to me. Alpha levels in meta-analyses or large-scale data analyses should be lowered as a function of the sample size. In case you are wondering: an alpha level of .005 would be used when the sample size is 10.000.

When designing a study based on a specific smallest effect size of interest, where you desire to have decent power (e.g., 90%), we run in to a small challenge because in the power analysis we now have two unknowns: The sample size (which is a function of the power, effect size, and alpha), and the standardized alpha level (which is a function of the sample size). Luckily, this is nothing that some R-fu can’t solve by some iterative power calculations. [R code to calculate the standardized alpha level, and perform an iterative power analysis, is at the bottom of the post]

When we wrote Justify Your Alpha (I recommend downloading the original draft before peer review because it has more words and more interesting references) one of the criticism I heard the most is that we gave no solutions how to justify your alpha. I hope this post makes it clear that statisticians have discussed that the alpha level should not be any fixed value even since it was invented. There are already some solutions available in the literature. I like Good’s approach because it is simple. In my experience, people like simple solutions. It might not be a full-fledged decision theoretical cost-benefit analysis, but it beats using a fixed alpha level. I recently used it in a submission for a Registered Report. At the same time, I think it has never been used in practice, so I look forward to any comments, conjectures, and conclusions you might have. 

References

Good, I. J. (1982). C140. Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16(1), 65–66. https://doi.org/10.1080/00949658208810607
Good, I. J. (1988). The interface between statistics and philosophy of science. Statistical Science, 3(4), 386–397.
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. https://doi.org/10.2307/2290192
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Leamer, E. E. (1978). Specification Searches: Ad Hoc Inference with Nonexperimental Data (1 edition). New York usw.: Wiley.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.



 

Equivalence Testing and the Second Generation P-Value

Recently Blume, D’Agostino McGowan, Dupont, & Greevy (2018) published an article titled: “Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses”. As it happens, I would greatly appreciate more rigor, reproducibility, and transparency in statistical analyses, so my interest was piqued. On Twitter I saw the following slide, promising a updated version of the p-value that can support null-hypotheses, takes practical significance into account, has a straightforward interpretation, and ideally never needs adjustments for multiple comparisons. Now it sounded like someone found the goose that lays the golden eggs.

Upon reading the manuscript, I noticed the statistic is surprisingly similar to equivalence testing, which I’ve written about recently and created an R package for (Lakens, 2017). The second generation p-value (SGPV) relies on specifying an equivalence range of values around the null-hypothesis that are practically equivalent to zero (e.g., 0 ± 0.3). If the estimation interval falls completely within the equivalence range, the SGPV is 1. If the confidence interval lies completely outside of the equivalence range, the SGPV is 0. Otherwise the SGPV is a value between 0 and 1 that expresses the overlap of the confidence interval with the equivalence bound, divided by the total width of the confidence interval.
Testing whether the confidence interval falls completely within the equivalence bounds is equivalent to the two one-sided tests (TOST) procedure, where the data is tested against the lower equivalence bound in the first one-sided test, and against the upper equivalence bound in the second one-sided test. If both tests allow you to reject an effect as extreme or more extreme than the equivalence bound, you can reject the presence of an effect large enough to be meaningful, and conclude the observed effect is practically equivalent to zero. You can also simply check if a 90% confidence interval falls completely within the equivalence bounds. Note that testing whether the 95% confidence interval falls completely outside of the equivalence range is known as a minimum-effect test (Murphy, Myors, & Wolach, 2014).
So together with my collaborator Marie Delacre we compared the two approaches, to truly understand how second generation p-values accomplished what they were advertised to do, and what they could contribute to our statistical toolbox.
To examine the relation between the TOST p-value and the SGPV we can calculate both statistics across a range of observed effect sizes. In Figure 1 p-values are plotted for the TOST procedure and the SGPV. The statistics are calculated for hypothetical one-sample t-tests for all means that can be observed in studies ranging from 140 to 150 (on the x-axis). The equivalence range is set to 145 ± 2 (i.e., an equivalence range from 143 to 147), the observed standard deviation is assumed to be 2, and the sample size is 100. The SGPV treats the equivalence range as the null-hypothesis, while the TOST procedure treats the values outside of the equivalence range as the null-hypothesis. For ease of comparison we can reverse the SGPV (by calculating 1-SGPV), which is used in the plot below.
 
 
Figure 1: Comparison of p-values from TOST (black line) and 1-SGPV (dotted grey line) across a range of observed sample means (x-axis) tested against a mean of 145 in a one-sample t-test with a sample size of 30 and a standard deviation of 2.
It is clear the SGPV and the p-value from TOST are very closely related. The situation in Figure 1 is not an exception – in our pre-print we describe how the SGPV and the p-value from the TOST procedure are always directly related when confidence intervals are symmetrical. You can play around with this Shiny app as confirm this for yourself: http://shiny.ieis.tue.nl/TOST_vs_SGPV/.
There are 3 situations where the p-value from the TOST procedure and the SGPV are not directly related. The SGPV is 1 when the confidence interval falls completely within the equivalence bounds. P-values from the TOST procedure continue to differentiate and will for example distinguish between a p = 0.048 and p = 0.002. The same happens when the SGPV is 0 (and p-values fall between 0.975 and 1).
The third situation when the TOST and SGPV differ is when the ‘small sample correction’ is at play in the SGPV. This “correction” kicks in whenever the confidence interval is wider than the equivalence range. However, it is not a correction in the typical sense of the word, since the SGPV is not adjusted to any ‘correct’ value. When the normal calculation would be ‘misleading’ (i.e., the SGPV would be small, which normally would suggest support for the alternative hypothesis, when all values in the equivalence range are also supported), the SGPV is set to 0.5 which according to Blume and colleagues signal the SGPV is ‘uninformative’.In all three situations the p-value from equivalence tests distinguishes between scenarios where the SGPV yields the same result.
We can examine this situation by calculating the SGPV and performing the TOST for a situation where sample sizes are small and the equivalence range is narrow, such that the CI is more than twice as large as the equivalence range.
 
Figure 2: Comparison of p-values from TOST (black line) and SGPV (dotted grey line) across a range of observed sample means (x-axis). Because the sample size is small (n = 10) and the CI is more than twice as wide as the equivalence range (set to -0.4 to 0.4), the SGPV is set to 0.5 (horizontal light grey line) across a range of observed means.

The main novelty of the SGPV is that it is meant to be used as a descriptive statistic. However, we show that the SGPV is difficult to interpret when confidence intervals are asymmetric, and when the ‘small sample correction’ is operating. For an extreme example, see Figure 3 where the SGPV’s are plotted for a correlation (where confidence intervals are asymmetric). 

Figure 3: Comparison of p-values from TOST (black line) and 1-SGPV (dotted grey curve) across a range of observed sample correlations (x-axis) tested against equivalence bounds of r = 0.4 and r = 0.8 with n = 10 and an alpha of 0.05.
Even under ideal circumstances, the SGPV is mainly meaningful when it is either 1, 0, or inconclusive (see all examples in Blume et al., 2018). But to categorize your results into one of these three outcomes you don’t need to calculate anything – you can just look at whether the confidence interval falls inside, outside, or overlaps with the equivalence bound, and thus the SGPV loses its value as a descriptive statistic. 
When discussing the lack of a need for error correction, Blume and colleagues compare the SGPV to null-hypothesis tests. However, the more meaningful comparison is with the TOST procedure, and given the direct relationship, not correcting for multiple comparisons will inflate the probability of concluding the absence of a meaningful effect in exactly the same way as when calculating p-values for an equivalence test. Equivalence tests provide an easier and more formal way to control both Type I error rates (by setting the alpha level) and the Type II error rate (by performing an a-priori power analysis, see Lakens, Scheele, & Isager, 2018).
Conclusion
There are strong similarities between p-values from the TOST procedure and the SGPV, and in all situations where the statistics yield different results, the behavior of the p-value from the TOST procedure is more consistent and easier to interpret. More details can be found in our pre-print (where you can also leave comments or suggestions for improvement using hypothes.is). Our comparisons show that when proposing alternatives to null-hypothesis tests, it is important to compare new proposals to already existing procedures. We believe equivalence tests achieve the goals of the second generation p-value while allowing users to more easily control error rates, and while yielding more consistent statistical outcomes.


References
Blume, J. D., D’Agostino McGowan, L., Dupont, W. D., & Greevy, R. A. (2018). Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLOS ONE, 13(3), e0188299. https://doi.org/10.1371/journal.pone.0188299
Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 2515245918770963. https://doi.org/10.1177/2515245918770963.
Murphy, K. R., Myors, B., & Wolach, A. H. (2014). Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (Fourth edition). New York: Routledge, Taylor & Francis Group.