When you perform multiple comparisons in a study, you need to control your alpha level for multiple comparisons. It is generally recommended to control for the family-wise error rate, but there is some confusion about what a ‘family’ is. As Bretz, Hothorn, & Westfall (2011) write in their excellent book “Multiple Comparisons Using R” on page 15: “The appropriate choice of null hypotheses being of primary interest is a controversial question. That is, it is not always clear which set of hypotheses should constitute the family *H _{1}*,…,

*H*. This topic has often been in dispute and there is no general consensus.” In one of the best papers on controlling for multiple comparisons out there, Bender & Lange (2001) write: “Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions. In addition to the problem of deciding which error rate should be under control, it has to be defined first which tests of a study belong to one experiment.” The Wikipedia page on family-wise error rate is a mess.

_{m}*inductive behavior*. The outcome of an experiment leads one to take different possible actions, which can be either practical (e.g., implement a new procedure, abandon a research line) or scientific (e.g., claim there is or is no effect). From an error-statistical approach (Mayo, 2018) inflated Type 1 error rates mean that it has become very likely that you will be able to claim support for your hypothesis, even when the hypothesis is wrong. This reduces the severity of the test. To prevent this, we need to control our error rate

*at the level of our claim*.

*p*= 0.02 in one of three tests really ‘significant’.

**All papers that argue against the need to control for multiple comparisons when testing hypotheses are wrong.**Yes, their existence and massive citation counts frustrate me. It is fine not to test a hypothesis, but when you do, and you make a claim based on a test, you need to control your error rates.

But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.

**An example of formalizing family-wise error control**

*t*-tests. This requires specifying our alpha level, and thus deciding whether we need to correct for multiple comparisons. How we control error rates depends on claim we want to make.

*p*-value of the first

*t*-test is smaller than alpha level, the

*p*-value of the second

*t*-test is smaller than the alpha level, or the

*p*-value of the first

*t*-test is smaller than the alpha level. In the scienceverse code, we specify a criterion for each test (a p-value smaller than the alpha level,

*p.value < alpha_level*) and conclude the hypothesis is corroborated if either of these criteria are met (

*“p_t_1 | p_t_2 | p_t_3”*).

*t*-test is the same, but we now have three hypotheses to evaluate (H1, H2, and H3). Each of these claims can be corroborated, or not.

# Evaluation of Statistical Hypotheses

#### 12 March, 2020

### Simulating Null Effects Postregistration

## Results

### Hypothesis 1: H1

Something will happen

`p_t_1`

is confirmed if analysis ttest_1 yields`p.value<0.05`

The result was p.value = 0.452 (FALSE)

`p_t_2`

is confirmed if analysis ttest_2 yields`p.value<0.05`

The result was p.value = 0.21 (FALSE)

`p_t_3`

is confirmed if analysis ttest_3 yields`p.value<0.05`

The result was p.value = 0.02 (TRUE)

#### Corroboration ( TRUE )

The hypothesis is corroborated if anything is significant.

` p_t_1 | p_t_2 | p_t_3 `

#### Falsification ( FALSE )

The hypothesis is falsified if nothing is significant.

` !p_t_1 & !p_t_2 & !p_t_3 `

**All criteria were met for corroboration.**

We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 – even though this was a Type 1 error, since we simulated data with a true effect size of 0 – and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:

# Evaluation of Statistical Hypotheses

#### 12 March, 2020

### Simulating Null Effects Postregistration

## Results

### Hypothesis 1: H1

dv1 will show an effect

`p_t_1`

is confirmed if analysis ttest_1 yields`p.value<0.05`

The result was p.value = 0.452 (FALSE)

#### Corroboration ( FALSE )

The hypothesis is corroborated if dv1 is significant.

` p_t_1 `

#### Falsification ( TRUE )

The hypothesis is falsified if dv1 is not significant.

` !p_t_1 `

**All criteria were met for falsification.**

### Hypothesis 2: H2

dv2 will show an effect

`p_t_2`

is confirmed if analysis ttest_2 yields`p.value<0.05`

The result was p.value = 0.21 (FALSE)

#### Corroboration ( FALSE )

The hypothesis is corroborated if dv2 is significant.

` p_t_2 `

#### Falsification ( TRUE )

The hypothesis is falsified if dv2 is not significant.

` !p_t_2 `

**All criteria were met for falsification.**

### Hypothesis 3: H3

dv3 will show an effect

`p_t_3`

is confirmed if analysis ttest_3 yields`p.value<0.05`

The result was p.value = 0.02 (TRUE)

#### Corroboration ( TRUE )

The hypothesis is corroborated if dv3 is significant.

` p_t_3 `

#### Falsification ( FALSE )

The hypothesis is falsified if dv3 is not significant.

` !p_t_3 `

**All criteria were met for corroboration.**

We now see that two hypotheses were falsified (yes, yes, I know you should not use *p* > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don’t also have to explain equivalence testing to readers not familiar with it – if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.

It might seem that the second approach, specifying each dv as it’s own hypothesis, is the way to go if you do not want to lower the alpha level to control for multiple comparisons. But take a look at the report of the study you have performed. You have made 3 predictions, of which 1 was corroborated. That is not an impressive success rate. Sure, mixed results happen, and you should interpret results not just based on the *p*-value (but on the strength of the experimental design, assumptions about power, your prior, the strength of the theory, etc.) but if these predictions were derived from the same theory, this set of results is not particularly impressive. Since researchers can never selectively report only those results that ‘work’ because this would be a violation of the code of research integrity, we should always be able to see the meager track record of predictions.If you don’t feel ready to make a specific predictions (and run the risk of sullying your track record) either do unplanned exploratory tests, and do not make claims based on their results, or preregister all possible tests you can think of, and massively lower your alpha level to control error rates (for example, genome-wide association studies sometimes use an alpha level of 5 x 10–8 to control the Type 1 erorr rate).

**References**

*Journal of Clinical Epidemiology*,

*54*(4), 343–349.

*Multiple comparisons using R*. CRC Press.

*Statistical inference as severe testing: How to get beyond the statistics wars*. Cambridge University Press.

*Revue de l’Institut International de Statistique / Review of the International Statistical Institute*,

*25*(1/3), 7. https://doi.org/10.2307/1401671

*Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.*