What’s a family in family-wise error control?

When you perform multiple comparisons in a study, you need to control your alpha level for multiple comparisons. It is generally recommended to control for the family-wise error rate, but there is some confusion about what a ‘family’ is. As Bretz, Hothorn, & Westfall (2011) write in their excellent book “Multiple Comparisons Using R” on page 15: “The appropriate choice of null hypotheses being of primary interest is a controversial question. That is, it is not always clear which set of hypotheses should constitute the family H1,…,Hm. This topic has often been in dispute and there is no general consensus.” In one of the best papers on controlling for multiple comparisons out there, Bender & Lange (2001) write: “Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions. In addition to the problem of deciding which error rate should be under control, it has to be defined first which tests of a study belong to one experiment.” The Wikipedia page on family-wise error rate is a mess.

I will be honest: I have never understood this confusion about what a family of tests is when controlling the family-wise error rate. At least not in a Neyman-Pearson approach to hypothesis testing, where the goal is to use data to make decisions about how to act. Neyman (Neyman, 1957) calls his approach inductive behavior. The outcome of an experiment leads one to take different possible actions, which can be either practical (e.g., implement a new procedure, abandon a research line) or scientific (e.g., claim there is or is no effect). From an error-statistical approach (Mayo, 2018) inflated Type 1 error rates mean that it has become very likely that you will be able to claim support for your hypothesis, even when the hypothesis is wrong. This reduces the severity of the test. To prevent this, we need to control our error rate at the level of our claim.
One reason the issue of family-wise error rates might remain vague, is that researchers are often vague about their claims. We do not specify our hypotheses unambiguously, and therefore this issue remains unclear. To be honest, I suspect another reason there is a continuing debate about whether and how to lower the alpha level to control for multiple comparisons in some disciplines is that 1) there are a surprisingly large number of papers written on this topic that argue you do not need to control for multiple comparisons, which are 2) cited a huge number of times giving rise to the feeling that surely they must have a point. Regrettably, the main reason these papers are written is because there are people who don’t think a Neyman-Pearson approach to hypothesis testing is a good idea, and the main reason these papers are cited is because doing so is convenient for researchers who want to publish statistically significant results, as they can justify why they are not lowering their alpha level, making that p = 0.02 in one of three tests really ‘significant’. All papers that argue against the need to control for multiple comparisons when testing hypotheses are wrong.  Yes, their existence and massive citation counts frustrate me. It is fine not to test a hypothesis, but when you do, and you make a claim based on a test, you need to control your error rates. 

But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.

An example of formalizing family-wise error control

Let’s assume we collect data from 100 participants in a control and treatment condition. We collect 3 dependent variables (dv1, dv2, and dv3). In the population there is no difference between groups on any of these three variables (the true effect size is 0). We will analyze the three dv’s in independent t-tests. This requires specifying our alpha level, and thus deciding whether we need to correct for multiple comparisons. How we control error rates depends on claim we want to make.
We might want to act as if (or claim that) our treatment works if there is a difference between the treatment and control conditions on any of the three variables. In scienceverse terms, this means we consider the prediction corroborated when the p-value of the first t-test is smaller than alpha level, the p-value of the second t-test is smaller than the alpha level, or the p-value of the first t-test is smaller than the alpha level. In the scienceverse code, we specify a criterion for each test (a p-value smaller than the alpha level, p.value < alpha_level) and conclude the hypothesis is corroborated if either of these criteria are met (“p_t_1 | p_t_2 | p_t_3”).  
We could also want to make three different predictions. Instead of one hypothesis (“something will happen”) we have three different hypotheses, and predict there will be an effect on dv1, dv2, and dv3. The criterion for each t-test is the same, but we now have three hypotheses to evaluate (H1, H2, and H3). Each of these claims can be corroborated, or not.
Scienceverse allows you to specify your hypotheses tests unambiguously (for code used in this blog, see the bottom of the post). It also allows you to simulate a dataset, which we can use to examine Type 1 errors by simulating data where no true effects exist. Finally, scienceverse allows you to run the pre-specified analyses on the (simulated) data, and will automatically create a report that summarizes which hypotheses were corroborated (which is useful when checking if the conclusions in a manuscript indeed follow from the preregistered analyses, or not). The output a single simulated dataset for the scenario where we will interpret any effect on the three dv’s as support for the hypothesis looks like this:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

Something will happen

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if anything is significant.

 p_t_1 | p_t_2 | p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if nothing is significant.

 !p_t_1 & !p_t_2 & !p_t_3 

All criteria were met for corroboration.

We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 – even though this was a Type 1 error, since we simulated data with a true effect size of 0 – and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

dv1 will show an effect

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv1 is significant.

 p_t_1 

Falsification ( TRUE )

The hypothesis is falsified if dv1 is not significant.

 !p_t_1 

All criteria were met for falsification.

Hypothesis 2: H2

dv2 will show an effect

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv2 is significant.

 p_t_2 

Falsification ( TRUE )

The hypothesis is falsified if dv2 is not significant.

 !p_t_2 

All criteria were met for falsification.

Hypothesis 3: H3

dv3 will show an effect

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if dv3 is significant.

 p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if dv3 is not significant.

 !p_t_3 

All criteria were met for corroboration.

We now see that two hypotheses were falsified (yes, yes, I know you should not use p > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don’t also have to explain equivalence testing to readers not familiar with it – if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.

It might seem that the second approach, specifying each dv as it’s own hypothesis, is the way to go if you do not want to lower the alpha level to control for multiple comparisons. But take a look at the report of the study you have performed. You have made 3 predictions, of which 1 was corroborated. That is not an impressive success rate. Sure, mixed results happen, and you should interpret results not just based on the p-value (but on the strength of the experimental design, assumptions about power, your prior, the strength of the theory, etc.) but if these predictions were derived from the same theory, this set of results is not particularly impressive. Since researchers can never selectively report only those results that ‘work’ because this would be a violation of the code of research integrity, we should always be able to see the meager track record of predictions.If you don’t feel ready to make a specific predictions (and run the risk of sullying your track record) either do unplanned exploratory tests, and do not make claims based on their results, or preregister all possible tests you can think of, and massively lower your alpha level to control error rates (for example, genome-wide association studies sometimes use an alpha level of 5 x 10–8 to control the Type 1 erorr rate).

Hopefully, specifying our hypotheses (and what would corroborate them) transparently by using scienceverse makes it clear what happens in the long run in both scenarios. In the long run, both the first scenario, if we would use an alpha level of 0.05/3 instead of 0.05, and the second scenario, with an alpha level of 0.05 for each individual hypothesis, will lead to the same end result: Not more than 5% of our claims will be wrong, if the null hypothesis is true. In the first scenario, we are making one claim in an experiment, and in the second we make three. In the second scenario we will end up with more false claims in an absolute sense, but the relative number of false claims is the same in both scenarios. And that’s exactly the goal of family-wise error control.
References
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. CRC Press.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.
Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7. https://doi.org/10.2307/1401671

Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.


Review of "The Generalizability Crisis" by Tal Yarkoni

A response to this blog by Tal Yarkoni is here.
In a recent preprint titled “The Generalizability Crisis“, Tal Yarkoni examines whether the current practice of how psychologists generalize from studies to theories is problematic. He writes: “The question taken up in this paper is whether or not the tendency to generalize psychology findings far beyond the circumstances in which they were originally established is defensible. The case I lay out in the next few sections is that it is not, and that unsupported generalization lies at the root of many of the methodological and sociological challenges currently affecting psychological science.” We had a long twitter discussion about the paper, and then read it in our reading group. In this review, I try to make my thoughts about the paper clear in one place, which might be useful if we want to continue to discuss whether there is a generalizability crisis, or not.

First, I agree with Yarkoni that almost all the proposals he makes in the section “Where to go from here?” are good suggestions. I don’t think they follow logically from his points about generalizability, as I detail below, but they are nevertheless solid suggestions a researcher should consider. Second, I agree that there are research lines in psychology where modelling more things as random factors will be productive, and a forceful manifesto (even if it is slightly less practical than similar earlier papers) might be a wake up call for people who had ignored this issue until now.

Beyond these two points of agreement, I found the main thesis in his article largely unconvincing. I don’t think there is a generalizability crisis, but the article is a nice illustration of why philosophers like Popper abandoned the idea of an inductive science. When Yarkoni concludes that “A direct implication of the arguments laid out above is that a huge proportion of the quantitative inferences drawn in the published psychology literature are so inductively weak as to be at best questionable and at worst utterly insensible.” I am primarily surprised he believes induction is a defensible philosophy of science. There is a very brief discussion of views by Popper, Meehl, and Mayo on page 19, but their work on testing theories is proposed as a probable not feasible solution – which is peculiar, because these authors would probably disagree with most of the points made by Yarkoni, and I would expect at least somewhere in the paper a discussion comparing induction against the deductive approach (especially since the deductive approach is arguably the dominant approach in psychology, and therefore none of the generalizability issues raised by Yarkoni are a big concern). Because I believe the article starts from a faulty position (scientists are not concerned with induction, but use deductive approaches) and because Yarkoni provides no empirical support for any of his claims that generalizability has led to huge problems (such as incredibly high Type 1 error rates), I remain unconvinced there is anything remotely close to the generalizability crisis he so evocatively argues for. The topic addressed by Yarkoni is very broad. It probably needs a book length treatment to do it justice. My review is already way too long, and I did not get into the finer details of the argument. But I hope this review helps to point out the parts of the manuscript where I feel important arguments lack a solid foundation, and where issues that deserve to be discussed are ignored.

Point 1: “Fast” and “slow” approaches need some grounding in philosophy of science.

Early in the introduction, Yarkoni says there is a “fast” and “slow” approach of drawing general conclusions from specific observations. Whenever people use words that don’t exactly describe what they mean, putting them in quotation marks is generally not a good idea. The “fast” and “slow” approaches he describes are not, I believe upon closer examination, two approaches “of drawing general conclusions from specific observations”.

The difference is actually between induction (the “slow” approach of generalizing from single observations to general observations) and deduction, as proposed by for example Popper. As Popper writes “According to the view that will be put forward here, the method of critically testing theories, and selecting them according to the results of tests, always proceeds on the following lines. From a new idea, put up tentatively, and not yet justified in any way—an anticipation, a hypothesis, a theoretical system, or what you will—conclusions are drawn by means of logical deduction.”

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments’”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the e?ect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.” Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

Point 2: Titles are not evidence for psychologist’s tendency to generalize too quickly.

This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

For me, this observation raised serious concerns about the statement Yarkoni makes that, simply from the titles of scientific articles, we can make a statement about whether authors make ‘fast’ or ‘slow’ generalizations. One reason is that Yarkoni examined titles from a scientific article that adheres to the publication manual of the APA. In the section on titles, the APA states: “A title should summarize the main idea of the manuscript simply and, if possible, with style. It should be a concise statement of the main topic and should identify the variables or theoretical issues under investigation and the relationship between them. An example of a good title is “Effect of Transformed Letters on Reading Speed.””. To me, it seems the authors are simply following the APA publication manual. I do not think their choice for a title provides us with any insight whatsoever about the tendency of authors to have a preference for ‘fast’ generalization. Again, this might be a minor point, but I found this an illustrative example of the strength of arguments in other places (see the next point for the most important example). Yarkoni needs to make a case that scientists are overgeneralizing, for there to be a generalizability crisis – but he does so unconvincingly. I sincerely doubt researchers expect their findings to generalize to all possible situations mentioned in the title, I doubt scientists believe titles are the place to accurately summarize limits of generalizability, and I doubt Yarkoni has made a strong point that psychologists overgeneralize based on this section. More empirical work would be needed to build a convincing case (e.g., code how researchers actually generalize their findings in a random selection of 250 articles, taking into account Gricean communication norms (especially the cooperative principle) in scientific articles).

Point 3: Theories and tests are not perfectly aligned in deductive approaches.

After explaining that psychologists use statistics to test predictions based on experiments that are operationalizations of verbal theories, Yarkoni notes: “From a generalizability standpoint, then, the key question is how closely the verbal and quantitative expressions of one’s hypothesis align with each other.”

Yarkoni writes: “When a researcher verbally expresses a particular hypothesis, she is implicitly defining a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis. If the researcher subsequently asserts that a particular statistical procedure provides a suitable test of the verbal hypothesis, she is making the tacit but critical assumption that the universe of admissible observations implicitly defined by the chosen statistical procedure (in concert with the experimental design, measurement model, etc.) is well aligned with the one implicitly defined by the qualitative hypothesis. Should a discrepancy between the two be discovered, the researcher will then face a choice between (a) working to resolve the discrepancy in some way (i.e., by modifying either the verbal statement of the hypothesis or the quantitative procedure(s) meant to provide an operational parallel); or (b) giving up on the link between the two and accepting that the statistical procedure does not inform the verbal hypothesis in a meaningful way.

I highlighted what I think is the critical point is in a bold font. To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed e?ects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned. As Yarkoni accurately summarizes based on an large multi-lab study on verbal overshadowing by Alogna: “given very conservative background assumptions, the massive Alogna et al. study—an initiative that drew on the efforts of dozens of researchers around the world—does not tell us much about the general phenomenon of verbal overshadowing. Under more realistic assumptions, it tells us essentially nothing.” This is also why Yarkoni’s first practical recommendation on how to move forward is to not solve the problem, but to do something else: “One perfectly reasonable course of action when faced with the difficulty of extracting meaningful, widely generalizable conclusions from e?ects that are inherently complex and highly variable is to opt out of the enterprise entirely.”

This is exactly the reason Popper (among others) rejected induction, and proposed a deductive approach. Why isn’t the alignment between theories and tests raised by Yarkoni a problem for the deductive approach proposed by Popper, Meehl, and Mayo? The reason is that the theory is tentatively posited as true, but in no way believed to be a complete representation of reality. This is an important difference. Yarkoni relies on an inductive approach, and thus the test needs to be aligned with the theory, and the theory defines “a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis.” For deductive approaches, this is not true.

For philosophers of science like Popper and Lakatos, a theory is not a complete description of reality. Lakatos writes about theories: “Each of them, at any stage of its development, has unsolved problems and undigested anomalies. All theories, in this sense, are born refuted and die refuted.” Lakatos gives the example that Newton’s Principia could not even explain the motion of the moon when it was published. The main point here: All theories are wrong. The fact that all theories (or models) are wrong should not be surprising. Box’s quote “All models are wrong, some are useful” is perhaps best known, but I prefer Box (1976) on parsimony: “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William Ockham (1285-1349) he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity (Ockham’s knife).” He follows this up by stating “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

In a deductive approach, the goal of a theoretical model is to make useful predictions. I doubt anyone believes that any of the models they are currently working on is complete. Some researchers might follow an instrumentalist philosophy of science, and don’t expect their theories to be anything more than useful tools. Lakatos’s (1978) main contribution to philosophy of science was to develop a way we deal with our incorrect theories, admitting that all needed adjustment, but some adjustments lead to progressive research lines, and others to degenerative research lines.

In a deductive model, it is perfectly fine to posit a theory that eating ice-cream makes people happy, without assuming this holds for all flavors, across all cultures, at all temperatures, and is irrespective of the amount of ice-cream eaten previously, and many other factors. After all, it is just a tentatively model that we hope is simple enough to be useful, and that we expect to become more complex as we move forward. As we increase our understanding of food preferences, we might be able to modify our theory, so that it is still simple, but also allows us to predict the fact that eggnog and bacon flavoured ice-cream do not increase happiness (on average). The most important thing is that our theory is tentative, and posited to allow us to make good predictions. As long as the theory is useful, and we have no alternatives to replace it with, the theory will continue to be used – without any expectation that is will generalize to all possible situations. As Box (1976) writes: “Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory.” A discussion of this large gap between Yarkoni and deductive approaches proposed by Popper and Meehl, where Yarkoni thinks theories and tests need to align, and deductive approaches see theories as tentative and wrong, should be included, I think. 

Point 4: The dismissal of risky predictions is far from convincing (and generalizability is typically a means to risky predictions, not a goal in itself).

If we read Popper (but also on the statistical side the work of Neyman) we see induction as a possible goal in science is clearly rejected. Yarkoni mentions deductive approaches briefly in his section on adopting better standards, in the sub-section on making riskier predictions. I intuitively expected this section to be crucial – after all, it finally turns to those scholars who would vehemently disagree with most of Yarkoni’s arguments in the preceding sections – but I found this part rather disappointing. Strangely enough, Yarkoni simply proposes predictions as a possible solution – but since the deductive approach goes directly against the inductive approach proposed by Yarkoni, it seems very weird to just mention risky predictions as one possible solution, when it is actually a completely opposite approach that rejects most of what Yarkoni argues for. Yarkoni does not seem to believe that the deductive mode proposed by Popper, Meehl, and Mayo, a hypothesis testing approach that is arguably the dominant approach in most of psychology (Cortina & Dunlap, 1997; Dienes, 2008; Hacking, 1965), has a lot of potential. The reason he doubts severe tests of predictions will be useful is that “in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding” (Yarkoni, p. 19). This could be resolved if risky predictions were possible, which Yarkoni doubts.

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests. It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

Point 5: Why care about statistical inferences, if these do not relate to sweeping verbal conclusions?

If we ignore all points previous points, we can still read Yarkoni’s paper as a call to introduce more random factors in our experiments. This nicely complements recent calls to vary all factors you do not thing should change the conclusions you draw (Baribault et al., 2018), and classic papers on random effects (Barr et al., 2013; Clark, 1969; Cornfield & Tukey, 1956).

Yarkoni generalizes from the fact that most scientists model subjects as a random factor, and then asks why scientists generalize to all sorts of other factors that were not in their models. He asks “Why not simply model all experimental factors, including subjects, as fixed e?ects”. It might be worth noting in the paper that sometimes researchers model subjects as fixed effects. For example, Fujisaki and Nishida (2009) write: “Participants were the two authors and five paid volunteers” and nowhere in their analyses do they assume there is any meaningful or important variation across individuals. In many perception studies, an eye is an eye, and an ear is an ear – whether from the author, or a random participant dragged into the lab from the corridor.

In other research areas, we do model individuals as a random factor. Yarkoni says we model stimuli as a random factor because: “The reason we model subjects as random e?ects is not that such a practice is objectively better, but rather, that this specification more closely aligns the meaning of the quantitative inference with the meaning of the qualitative hypothesis we’re interested in evaluating”. I disagree. I think we model certain factor as random effects because we have a high prior these factors influence the effect, and leaving them out of the model would reduce the strength of our prediction. Leaving them out reduces the probability a test will show we are wrong, if we are wrong. It impacts the severity of the test. Whether or not we need to model factors (e.g., temperature, the experimenter, or day of the week) as random factors because not doing so reduces the severity of a test is a subjective judgments. Research fields need to decide for themselves. It is very well possible more random factors are generally needed, but I don’t know how many, and doubt it will ever be as severe are the ‘generalizability crisis’ suggests. If it is as severe as Yarkoni suggests, some empirical demonstrations of this would be nice. Clark (1973) showed his language-as-fixed-effect fallacy using real data. Barr et al (2013) similarly made their point based on real data. I currently do not find the theoretical point very strong, but real data might convince me otherwise.

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic. Similarly, Cornfield & Tukey (1956) more pragmatically list options ranging from ignoring factors altogether, to randomizing them, or including them as a factor, and note “Each of these attitudes is appropriate in its place. In every experiment there are many variables which could enter, and one of the great skills of the experimenter lies in leaving out only inessential ones.” Just as pragmatically, Clark (1973) writes: “The wide-spread capitulation to the language-as-fixed-effect fallacy, though alarming, has probably not been disastrous. In the older established areas, most experienced investigators have acquired a good feel for what will replicate on a new language sample and what will not. They then design their experiments accordingly.” As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why. In some ways, Yarkoni’s point generalizes the argument that most findings in psychology do not generalize to non-WEIRD populations (Henrich et al., 2010), and it has the same weakness. WEIRD is a nice acronym, but it is just a completely random collection of 5 factors that might limit generalizability. The WEIRD acronym functions more as a nice reminder that boundary conditions exist, but it does not allow us to predict when they exist, or when they matter enough to be included in our theories. Currently, there is a gap between the factors that in theory could matter, and the factors that we should in practice incorporate. Maybe it is my pragmatic nature, but without such a discussion, I think the paper offers relatively little progress compared to previous discussions about generalizability (of which there are plenty).

Conclusion

A large part of Yarkoni’s argument is based on the fact that theories and tests should be closely aligned, while in a deductive approach based on severe tests of predictions, models are seen as simple, tentative, and wrong, and this is not considered a problem. Yarkoni does not convincingly argue researchers want to generalize extremely broadly (although I agree papers would benefit from including Constraints on Generalizability statements a proposed by Simons and colleagues (2017), but mainly because this improves falsifiability, not because it improves induction), and even if there is the tendency to overclaim in articles, I do not think this leads to an inferential crisis. Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice. Until Yarkoni does the latter convincingly, I don’t think the generalizability crisis as he sketches it is something that will keep me up at night.

References

Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Ravenzwaaij, D. van, White, C. N., Boeck, P. D., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3). https://doi.org/10.1016/j.jml.2012.11.001

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10/gdm28w

Clark, H. H. (1969). Linguistic processes in deductive reasoning. Psychological Review, 76(4), 387–404. https://doi.org/10.1037/h0027578

Cornfield, J., & Tukey, J. W. (1956). Average Values of Mean Squares in Factorials. The Annals of Mathematical Statistics, 27(4), 907–949. https://doi.org/10.1214/aoms/1177728067

Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161.

Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Palgrave Macmillan.

Fujisaki, W., & Nishida, S. (2009). Audio–tactile superiority over visuo–tactile and audio–visual combinations in the temporal resolution of synchrony perception. Experimental Brain Research, 198(2), 245–259. https://doi.org/10.1007/s00221-009-1870-x

Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29–29.

Lakens, D. (2020). The Value of Preregistration for Psychological Science: A Conceptual Analysis. Japanese Psychological Review. https://doi.org/10.31234/osf.io/jbh4w

Munafò, M. R., & Smith, G. D. (2018). Robust research needs many lines of evidence. Nature, 553(7689), 399–401. https://doi.org/10.1038/d41586-018-01023-3

Orben, A., & Lakens, D. (2019). Crud (Re)defined. https://doi.org/10.31234/osf.io/96dpy

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on Generality (COG): A Proposed Addition to All Empirical Papers. Perspectives on Psychological Science, 12(6), 1123–1128. https://doi.org/10.1177/1745691617708630

Observed Type 1 Error Rates (Why Statistical Models are Not Reality)

“In the long run we are all dead.” – John Maynard Keynes
When we perform hypothesis tests in a Neyman-Pearson framework we want to make decisions while controlling the rate at which we make errors. We do this in part by setting an alpha level that guarantees we will not say there is an effect when there is no effect more than ?% of the time, in the long run.
I like my statistics applied. And in practice I don’t do an infinite number of studies. As Keynes astutely observed, I will be dead before then. So when I control the error rate for my studies, what is a realistic Type 1 error rate I will observe in the ‘somewhat longer run’?
Let’s assume you publish a paper that contains only a single p-value. Let’s also assume the true effect size is 0, so the null hypothesis is true. Your test will return a p-value smaller than your alpha level (and this would be a Type 1 error) or not. With a single study, you don’t have the granularity to talk about a 5% error rate.

In experimental psychology 30 seems to be a reasonable average for the number of p-values that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872). Let’s assume you perform 30 tests in a single paper and every time the null is true (even though this is often unlikely in a real paper). In the long run, with an alpha level of 0.05 we can expect that 30 * 0.05 = 1.5 p-values will be significant. But in real sets of 30 p-values there is no half of a p-value, so you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more. We can plot the frequency of Type 1 error rates for 1 million sets of 30 tests.

Each of these error rates occurs with a certain frequency. 21.5% of the time, you will not make any Type 1 errors. 12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but no single set of studies is average.

Now maybe a single paper with 30 tests is not ‘long runnerish’ enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let’s assume we are interested in a meta-analysis worth of 200 studies that examine a topic where the true effect size is 0 for each test. We can plot the frequency of Type 1 error rates for 1 million sets of 200 tests.
 


Now things start to look a bit more like what you would expect. The Type 1 error rate you will observe in your set of 200 tests is close to 5%. However, it is almost exactly as likely that the observed Type 1 error rate is 4.5%. 90% of the distribution of observed alpha levels will lie between 0.025 and 0.075. So, even in ‘somewhat longrunnish’ 200 tests, the observed Type 1 error rate will rarely be exactly 5%, and it might be more useful to think about it as being between 2.5 and 7.5%.

Statistical models are not reality.

A 5% error rate exists only in the abstract world of infinite repetitions, and you will not live long enough to perform an infinite number of studies. In practice, if you (or a group of researchers examining a specific question) do real research, the error rates are somewhere in the range of 5%. Everything has variation in samples drawn from a larger population – error rates are no exception.
When we quantify things, there is the tendency to get lost in digits. But in practice, the levels of random noise we can reasonable expect quickly overwhelms everything at least 3 digits after the decimal. I know we can compute the alpha level after a Pocock correction for two looks at the data in sequential analyses as 0.0294. But this is not the level of granularity that we should have in mind when we think of the error rate we will observe in real lines of research. When we control our error rates, we do so with the goal to end up somewhere reasonably low, after a decent number of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors or 7.5% errors: Potato, patato.
This does not mean we should stop quantifying numbers precisely when they can be quantified precisely, but we should realize what we get from the statistical procedures we use. We don’t get a 5% Type 1 error rate in any real set of studies we will actually perform. Statistical inferences guide us roughly to where we would ideally like to end up. By all means calculate exact numbers where you can. Strictly adhere to hard thresholds to prevent you from fooling yourself too often. But maybe in 2020 we can learn to appreciate statistical inferences are always a bit messy. Do the best you reasonably can, but don’t expect perfection. In 2020, and in statistics.

Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers. 

Do You Really Want to Test a Hypothesis?

I’ve uploaded one of my favorite lectures in the my new MOOC “Improving Your Statistical Questions” to YouTube. It asks the question whether you really want to test a hypothesis. A hypothesis is a very specific tool to answer a very specific question. I like hypothesis tests, because in experimental psychology it is common to perform lines of research where you can design a bunch of studies that test simple predictions about the presence or absence of differences on some measure. I think they have a role to play in science. I also think hypothesis testing is widely overused. As we are starting to do hypothesis tests better (e.g., by preregisteringour predictions and controlling our error rates in more severe tests) I predict many people will start to feel a bit squeamish as they become aware that doing hypothesis tests as they were originally designed to be used isn’t really want they want in their research. One of the often overlooked gains in teaching people how to do something well, is that they finally realize that they actually don’t want to do it.
The lecture “Do You Really Want to Test a Hypothesis” aims to explain which question a hypothesis tests asks, and discusses when a hypothesis tests answers a question you are interested in. It is very easy to say what not to do, or to point out what is wrong with statistical tools. Statistical tools are very limited, even under ideal circumstances. It’s more difficult to say what you can do. If you follow my work, you know that this latter question is what I spend my time on. Instead of telling you optional stopping can’t be done because it is p-hacking, I explain how you can do it correctly through sequential analysis. Instead of telling you it is wrong to conclude the absence of an effect from p > 0.05, I explain how to use equivalence testing­­. Instead of telling you p-values are the devil, I explain how they answer a question you might be interested in when used well. Instead of saying preregistration is redundant, I explain from which philosophy of science preregistration has value. And instead of saying we should abandon hypothesis tests, I try to explain in this video how to use them wisely. This is all part of my ongoing #JustifyEverything educational tour. I think it is a reasonable expectation that researchers should be able to answer at least a simple ‘why’ question if you ask why they use a specific tool, or use a tool in a specific manner.
This might help to move beyond the simplistic discussion I often see about these topics. If you ask me if I prefer frequentist of Bayesian statistics, or confirmatory or exploratory research, I am most likely to respond ? (see Wikipedia). It is tempting to think about these topics in a polarized either-or mindset – but then you would miss asking the real questions. When would any approach give you meaningful insights? Just as not every hypothesis test is an answer to a meaningful question, so will not every exploratory study provide interesting insights. The most important question to ask yourself when you plan a study is ‘when will the tools you use lead to interesting insights’? In the second week of my MOOC I discuss when effects in hypothesis tests could be deemed meaningful, but the same question applies to exploratory or descriptive research. Not all exploration is interesting, and we don’t want to simply describe every property of the world. Again, it is easy to dismiss any approach to knowledge generation, but it is so much more interesting to think about which tools willlead to interesting insights. And above all, realize that in most research lines, researchers will have a diverse set of questions that they want to answer given practical limitations, and they will need to rely on a diverse set of tools, limitations and all.
In this lecture I try to explain what the three limitations are of hypothesis tests, and the very specific question they try to answer. If you like to think about how to improve your statistical questions, you might be interested in enrolling in my free MOOC Improving Your Statistical Questions”.

Improving Your Statistical Questions

Three years after launching my first massive open online course (MOOC) ‘Improving Your Statistical Inferences’ on Coursera, today I am happy to announce a second completely free online course called ‘Improving Your Statistical Questions’. My first course is a collection of lessons about statistics and methods that we commonly use, but that I wish I had known how to use better when I was taking my first steps into empirical research. My new course is a collection of lessons about statistics and methods that we do not yet commonly use, but that I wish we start using to improve the questions we ask. Where the first course tries to get people up to speed about commonly accepted best practices, my new course tries to educate researchers about better practices. Most of the modules consist of topics in which there has been more recent developments, or at least increasing awareness, over the last 5 years.

About a year ago, I wrote on this blog: If I ever make a follow up to my current MOOC, I will call it ‘Improving Your Statistical Questions’. The more I learn about how people use statistics, the more I believe the main problem is not how people interpret the numbers they get from statistical tests. The real issue is which statistical questions researchers ask from their data. If you approach a statistician to get help with the data analysis, most of their time will be spend asking you ‘but what is your question?’. I hope this course helps to take a step back, reflect on this question, and get some practical advice on how to answer it.
There are 5 modules, with 15 videos, and 13 assignments that provide hands on explanations of how to use the insights from the lectures in your own research. The first week discusses different questions you might want to ask. Only one of these is a hypothesis test, and I examine in detail if you really want to test a hypothesis, or are simply going through the motions of the statistical ritual. I also discuss why NHST is often not a very risky prediction, and why range predictions are a more exciting question to ask (if you can). Module 2 focuses on falsification in practice and theory, including a lecture and some assignments on how to determine the smallest effect size of interest in the studies you perform. I also share my favorite colloquium question for whenever you dozed of and wake up at the end only to find no one else is asking a question, when you can always raise you hand to ask ‘so, what would falsify your hypothesis?’ Module 3 discusses the importance of justifying error rates, a more detailed discussion on power analysis (following up on the ‘sample size justification’ lecture in MOOC1), and a lecture on the many uses of learning how to simulate data. Module 4 moves beyond single studies, and asks what you can expect from lines of research, how to perform a meta-analysis, and why the scientific literature does not look like reality (and how you can detect, and prevent contributing to, a biased literature). I was tempted to add this to MOOC1, but I am happy I didn’t, as there has been a lot of exciting work on bias detection that is now part of the lecture. The last module has three different topics I think are important: computational reproducibility, philosophy of science (this video would also have been a good first video lecture, but I don’t want to scare people away!) and maybe my favorite lecture in the MOOC on scientific integrity in practice. All are accompanied by assignments, and the assignments is where the real learning happens.
If after this course some people feel more comfortable to abandon hypothesis testing and just describe their data, make their predictions a bit more falsifiable, design more informative studies, publish sets of studies that look a bit more like reality, and make their work more computationally reproducible, I’ll be very happy.
The content of this MOOC is based on over 40 workshops and talks I gave in the last 3 years since my previous MOOC came out, testing this material on live crowds. It comes with some of the pressure a recording artist might feel for a second record when their first was somewhat successful. As my first MOOC hits 30k enrolled learners (many of who attend very few of the content, but still with thousands of people taking in a lot of the material) I hope it comes close and lives up to expectations.
I’m very grateful to Chelsea Parlett Pelleriti who checked all assignments for statistical errors or incorrect statements, and provided feedback that made every exercise in this MOOC better. If you need a statistics editor, you can find her at: https://cmparlettpelleriti.github.io/TheChatistician.html. Special thanks to Tim de Jonge who populated the Coursera environment as a student assistant, and Sascha Prudon for recording and editing the videos. Thanks to Uri Simonsohn for feedback on Assignment 2.1, Lars Penke for suggesting the SESOI example in lecture 2.2, Lisa DeBruine for co-developing Assignment 2.4, Joe Hilgard for the PET-PEESE code in assignment 4.3, Matti Heino for the GRIM test example in lecture 4.3, and Michelle Nuijten for feedback on assignment 4.4. Thanks to Seth Green, Russ Zack and Xu Fei at Code Ocean for help in using their platform to make it possible to run the R code online. I am extremely grateful for all alpha testers who provided feedback on early versions of the assignments: Daniel Dunleavy, Robert Gorsch, Emma Henderson, Martine Jansen, Niklas Johannes, Kristin Jankowsky, Cian McGinley, Robert Görsch, Chris Noone, Alex Riina, Burak Tunca, Laura Vowels, and Lara Warmelink, as well as the beta-testers who gave feedback on the material on Coursera: Johannes Breuer, Marie Delacre, Fabienne Ennigkeit, Marton L. Gy, and Sebastian Skejø. Finally, thanks to my wife for buying me six new shirts because ‘your audience has expectations’ (and for accepting how I worked through the summer holiday to complete this MOOC).
All material in the MOOC is shared with a CC-BY-NC-SA license, and you can access all material in the MOOC for free (and use it in your own education). Improving Your Statistical Questions is available from today. I hope you enjoy it!

Requiring high-powered studies from scientists with resource constraints

This blog post is now included in the paper “Sample size justification” available at PsyArXiv. 

Underpowered studies make it very difficult to learn something useful from the studies you perform. Low power means you have a high probability of finding non-significant results, even when there is a true effect. Hypothesis tests which high rates of false negatives (concluding there is nothing, when there is something) become a malfunctioning tool. Low power is even more problematic combined with publication bias (shiny app). After repeated warnings over at least half a century, high quality journals are starting to ask authors who rely on hypothesis tests to provide a sample size justification based on statistical power.
The first time researchers use power analysis software, they typically think they are making a mistake, because the sample sizes required to achieve high power for hypothesized effects are much larger than the sample sizes they collected in the past. After double checking their calculations, and realizing the numbers are correct, a common response is that there is no way they are able to collect this number of observations.
Published articles on power analysis rarely tell researchers what they should do if they are hired on a 4 year PhD project where the norm is to perform between 4 to 10 studies that can cost at most 1000 euro each, learn about power analysis, and realize there is absolutely no way they will have the time and resources to perform high-powered studies, given that an effect size estimate from an unbiased registered report suggests the effect they are examining is half as large as they were led to believe based on a published meta-analysis from 2010. Facing a job market that under the best circumstances is a nontransparent marathon for uncertainty-fetishists, the prospect of high quality journals rejecting your work due to a lack of a solid sample size justification is not pleasant.
The reason that published articles do not guide you towards practical solutions for a lack of resources, is that there are no solutions for a lack of resources. Regrettably, the mathematics do not care about how small the participant payment budget is that you have available. This is not to say that you can not improve your current practices by reading up on best practices to increase the efficiency of data collection. Let me give you an overview of some things that you should immediately implement if you use hypothesis tests, and data collection is costly.
1) Use directional tests where relevant. Just following statements such as ‘we predict X is larger than Y’ up with a logically consistent test of that claim (e.g., a one-sided t-test) will easily give you an increase of 10% power in any well-designed study. If you feel you need to give effects in both directions a non-zero probability, then at least use lopsided tests.
2) Use sequential analysis whenever possible. It’s like optional stopping, but then without the questionable inflation of the false positive rate. The efficiency gains are so great that, if you complain about the recent push towards larger sample sizes without already having incorporated sequential analyses, I will have a hard time taking you seriously.
3) Increase your alpha level. Oh yes, I am serious. Contrary to what you might believe, the recommendation to use an alpha level of 0.05 was not the sixth of the ten commandments – it is nothing more than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify Your Alpha paper as an argument to not require an alpha level of 0.005: “without (1) increased funding, (2) a reward system that values large-scale collaboration and (3) clear recommendations for how to evaluate research with sample size constraints, lowering the significance threshold could adversely affect the breadth of research questions examined.” If you *have* to make a decision, and the data you can feasibly collect is limited, take a moment to think about how problematic Type 1 and Type 2 error rates are, and maybe minimize combined error rates instead of rigidly using a 5% alpha level.
4) Use within designs where possible. Especially when measurements are strongly correlated, this can lead to a substantial increase in power.
5) If you read this blog or follow me on Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and money: Designing statistically powerful studies while minimizing financial costs (link). They discuss I) better ways to screen participants for studies where participants need to be screened before participation, II) assigning participants unequally to conditions (if the control condition is much cheaper than the experimental condition, for example), III) using multiple measurements to increase measurement reliability (or use well-validated measures, if I may add), and IV) smart use of (preregistered, I’d recommend) covariates.
6) If you are really brave, you might want to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably, almost all approaches to statistical inferences become very limited when the number of observations is small. If you are very confident in your predictions (and your peers agree), incorporating prior information will give you a benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and colleagues.
Now if you care about efficiency, you might already have incorporated all these things. There is no way to further improve the statistical power of your tests, and by all plausible estimates of effects sizes you can expect or the smallest effect size you would be interested in, statistical power is low. Now what should you do?
What to do if best practices in study design won’t save you?
The first thing to realize is that you should not look at statistics to save you. There are no secret tricks or magical solutions. Highly informative experiments require a large number of observations. So what should we do then? The solutions below are, regrettably, a lot more work than making a small change to the design of your study. But it is about time we start to take them seriously. This is a list of solutions I see – but there is no doubt more we can/should do, so by all means, let me know your suggestions on twitter or in the comments.
1) Ask for a lot more money in your grant proposals.
Some grant organizations distribute funds to be awarded as a function of how much money is requested. If you need more money to collect informative data, ask for it. Obviously grants are incredibly difficult to get, but if you ask for money, include a budget that acknowledges that data collection is not as cheap as you hoped some years ago. In my experience, psychologists are often asking for much less money to collect data than other scientists. Increasing the requested funds for participant payment by a factor of 10 is often reasonable, given the requirements of journals to provide a solid sample size justification, and the more realistic effect size estimates that are emerging from preregistered studies.
2) Improve management.
If the implicit or explicit goals that you should meet are still the same now as they were 5 years ago, and you did not receive a miraculous increase in money and time to do research, then an update of the evaluation criteria is long overdue. I sincerely hope your manager is capable of this, but some ‘upward management’ might be needed. In the coda of Lakens & Evers (2014) we wrote “All else being equal, a researcher running properly powered studies will clearly contribute more to cumulative science than a researcher running underpowered studies, and if researchers take their science seriously, it should be the former who is rewarded in tenure systems and reward procedures, not the latter.” and “We believe reliable research should be facilitated above all else, and doing so clearly requires an immediate and irrevocable change from current evaluation practices in academia that mainly focus on quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure track, I thought it would be at least principled if I sent this coda to the head of my own department. He replied that the things we wrote made perfect sense, instituted a recommendation to aim for 90% power in studies our department intends to publish, and has since then tried to make sure quality, and not quantity, is used in evaluations within the faculty (as you might have guessed, I am not on the job market, nor do I ever hope to be).
3) Change what is expected from PhD students.
When I did my PhD, there was the assumption that you performed enough research in the 4 years you are employed as a full-time researcher to write a thesis with 3 to 5 empirical chapters (with some chapters having multiple studies). These studies were ideally published, but at least publishable. If we consider it important for PhD students to produce multiple publishable scientific articles during their PhD’s, this will greatly limit the types of research they can do. Instead of evaluating PhD students based on their publications, we can see the PhD as a time where researchers learn skills to become an independent researcher, and evaluate them not based on publishable units, but in terms of clearly identifiable skills. I personally doubt data collection is particularly educational after the 20th participant, and I would probably prefer to  hire a post-doc who had well-developed skills in programming, statistics, and who broadly read the literature, then someone who used that time to collect participant 21 to 200. If we make it easier for PhD students to demonstrate their skills level (which would include at least 1 well written article, I personally think) we can evaluate what they have learned in a more sensible manner than now. Currently, difference in the resources PhD students have at their disposal are a huge confound as we try to judge their skill based on their resume. Researchers at rich universities obviously have more resources – it should not be difficult to develop tools that allow us to judge the skills of people where resources are much less of a confound.
4) Think about the questions we collectively want answered, instead of the questions we can individually answer.
Our society has some serious issues that psychologists can help address. These questions are incredibly complex. I have long lost faith in the idea that a bottom-up organized scientific discipline that rewards individual scientists will manage to generate reliable and useful knowledge that can help to solve these societal issues. For some of these questions we need well-coordinated research lines where hundreds of scholars work together, pool their resources and skills, and collectively pursuit answers to these important questions. And if we are going to limit ourselves in our research to the questions we can answer in our own small labs, these big societal challenges are not going to be solved. Call me a pessimist. There is a reason we resort to forming unions and organizations that have to goal to collectively coordinate what we do. If you greatly dislike team science, don’t worry – there will always be options to make scientific contributions by yourself. But now, there are almost no ways for scientists who want to pursue huge challenges in large well-organized collectives of hundreds or thousands of scholars (for a recent exception that proves my rule by remaining unfunded: see the Psychological Science Accelerator). If you honestly believe your research question is important enough to be answered, then get together with everyone who also thinks so, and pursue answeres collectively. Doing so should, eventually (I know science funders are slow) also be more convincing as you ask for more resources to do the resource (as in point 1).
If you are upset that as a science we lost the blissful ignorance surrounding statistical power, and are requiring researchers to design informative studies, which hits substantially harder in some research fields than in others: I feel your pain. I have argued against universally lower alpha levels for you, and have tried to write accessible statistics papers that make you more efficient without increasing sample sizes. But if you are in a research field where even best practices in designing studies will not allow you to perform informative studies, then you need to accept the statistical reality you are in. I have already written too long a blog post, even though I could keep going on about this. My main suggestions are to ask for more money, get better management, change what we expect from PhD students, and self-organize – but there is much more we can do, so do let me know your top suggestions. This will be one of the many challenges our generation faces, but if we manage to address it, it will lead to a much better science.

Justify Your Alpha by Minimizing or Balancing Error Rates

A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends the ideas in this blog post is available at: https://psyarxiv.com/ts4r6

In 1957 Neyman wrote: “it appears desirable to determine the level of significance in accordance with quite a few circumstances that vary from one particular problem to the next.” Despite this good advice, social scientists developed the norm to always use an alpha level of 0.05 as a threshold when making predictions. In this blog post I will explain how you can set the alpha level so that it minimizes the combined Type 1 and Type 2 error rates (thus efficiently making decisions), or balance Type 1 and Type 2 error rates. You can use this approach to justify your alpha level, and guide your thoughts about how to design studies more efficiently.

Neyman (1933) provides an example of the reasoning process he believed researchers should go through. He explains how a researcher might have derived an important hypothesis that H0 is true (there is no effect), and will not want to ‘throw it aside too lightly’. The researcher would choose a ow alpha level (e.g.,  0.01). In another line of research, an experimenter might be interesting in detecting factors that would lead to the modification of a standard law, where the “importance of finding some new line of development here outweighs any loss due to a certain waste of effort in starting on a false trail”, and Neyman suggests to set the alpha level to for example 0.1.

Which is worse? A Type 1 Error or a Type 2 Error?

As you perform lines of research the data you collect are used as a guide to continue or abandon a hypothesis, to use one paradigm or another. One goal of well-designed experiments is to control the error rates as you make these decisions, so that you do not fool yourself too often in the long run.

Many researchers implicitly assume that Type 1 errors are more problematic than Type 2 errors. Cohen (1988) suggested a Type 2 error rate of 20%, and hence to aim for 80% power, but wrote “.20 is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of .20/.05, i.e., that Type I errors are of the order of four times as serious as Type II errors. This .80 desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value ad hoc”. More recently, researchers have argued that false negative constitute a much more serious problem in science (Fiedler, Kutzner, & Krueger, 2012). I always ask my 3rd year bachelor students: What do you think? Is a Type 1 error in your next study worse than a Type 2 error?

Last year I listened to someone who decided whether new therapies would be covered by the German healthcare system. She discussed Eye Movement Desensitization and Reprocessing (EMDR) therapy. I knew that the evidence that the therapy worked was very weak. As the talk started, I hoped they had decided not to cover EMDR. They did, and the researcher convinced me this was a good decision. She said that, although no strong enough evidence was available that it works, the costs of the therapy (which can be done behind a computer) are very low, it was applied in settings where no really good alternatives were available (e.g., inside prisons), and risk of negative consequences was basically zero. They were aware of the fact that there was a very high probability that EMDR was a Type 1 error, but compared to the cost of a Type 2 error, it was still better to accept the treatment. Another of my favorite examples comes from Field et al. (2004) who perform a cost-benefit analysis on whether to intervene when examining if a koala population is declining, and show the alpha should be set at 1 (one should always assume a decline is occurring and intervene). 

Making these decisions is difficult – but it is better to think about them, then to end up with error rates that do not reflect the errors you actually want to make. As Ulrich and Miller (2019) describe, the long run error rates you actually make depend on several unknown factors, such as the true effect size, and the prior probability that the null hypothesis is true. Despite these unknowns, you can design studies that have good error rates for an effect size you are interested in, given some sample size you are planning to collect. Let’s see how.

Balancing or minimizing error rates

Mudge, Baker, Edge, and Houlahan (2012) explain how researchers might want to minimize the total combined error rate. If both Type 1 as Type 2 errors are costly, then it makes sense to optimally reduce both errors as you do studies. This would make decision making overall most efficient. You choose an alpha level that, when used in the power analysis, leads to the lowest combined error rate. For example, with a 5% alpha and 80% power, the combined error rate is 5+20 = 25%, and if power is 99% and the alpha is 5% the combined error rate is 1 + 5 = 6%. Mudge and colleagues show that the increasing or reducing the alpha level can lower the combined error rate. This is one of the approaches we mentioned in our ‘Justify Your Alpha’ paper from 2018.

When we wrote ‘Justify Your Alpha’ we knew it would be a lot of work to actually develop methods that people can use. For months, I would occasionally revisit the code Mudge and colleagues used in their paper, which is an adaptation of the pwr library in R, but the code was too complex and I could not get to the bottom of how it worked. After leaving this aside for some months, during which I improved my R skills, some days ago I took a long shower and suddenly realized that I did not need to understand the code by Mudge and colleagues. Instead of getting their code to work, I could write my own code from scratch. Such realizations are my justification for taking showers that are longer than is environmentally friendly.

If you want to balance or minimize error rates, the tricky thing is that the alpha level you set determines the Type 1 error rate, but through it’s influence on the statistical power, also influenced the Type 2 error rate. So I wrote a function that examines the range of possible alpha levels (from 0 to 1) and minimizes either the total error (Type 1 + Type 2) or minimizes the difference between the Type 1 and Type 2 error rates, balancing the error rates. It then returns the alpha (Type 1 error rate) and the beta (Type 2 error). You can enter any analytic power function that normally works in R and would output the calculated power.

Minimizing Error Rates

Below is the version of the optimal_alpha function used in this blog. Yes, I am defining a function inside another function and this could all look a lot prettier – but it works for now. I plan to clean up the code when I archive my blog posts on how to justify alpha level in a journal, and will make an R package when I do.


The code requires requires you to specify the power function (in a way that the code returns the power, hence the $power at the end) for your test, where the significance level is a variable ‘x’. In this power function you specify the effect size (such as the smallest effect size you are interested in) and the sample size. In my experience, sometimes the sample size is determined by factors outside the control of the researcher. For example, you are working with a existing data, or you are studying a sample size that is limited (e.g., all students in a school). Other times, people have a maximum sample size they can feasibly collect, and accept the error rates that follow from this feasibility limitation. If your sample size is not limited, you can increase the sample size until you are happy with the error rates.

The code calculates the Type 2 error (1-power) across a range of alpha values. For example, we want to calculate the optimal alpha level for a independent t-test. Assume our smallest effect size of interest is d = 0.5, and we are planning to collect 100 participants in each group. We would normally calculate power as follows:

pwr.t.test(d = 0.5, n = 100, sig.level = 0.05, type = ‘two.sample’, alternative = ‘two.sided’)$power

This analysis tells us that we have 94% power with a 5% alpha level for our smallest effect size of interest, d = 0.5, when we collect 100 participants in each condition.

If we want to minimize our total error rates, we would enter this function in our optimal_alpha function (while replacing the sig.level argument with ‘x’ instead of 0.05, because we are varying the value to determine the lowest combined error rate).

res = optimal_alpha(power_function = pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”)

res$alpha
## [1] 0.05101728
res$beta
## [1] 0.05853977

We see that an alpha level of 0.051 slightly improved the combined error rate, since it will lead to a Type 2 error rate of 0.059 for a smallest effect size of interest of d = 0.5. The combined error rate is 0.11. For comparison, lowering the alpha level to 0.005 would lead to a much larger combined error rate of 0.25.
What would happen if we had decided to collect 200 participants per group, or only 50? With 200 participants per group we would have more than 99% power for d = 0.05, and relatively speaking, a 5% Type 1 error with a 1% Type 2 error is slightly out of balance. In the age of big data, we nevertheless researchers use such suboptimal error rates this all the time due to their mindless choice for an alpha level of 0.05. When power is large the combined error rates can be smaller if the alpha level is lowered. If we just replace 100 by 200 in the function above, we see the combined Type 1 and Type 2 error rate is the lowest if we set the alpha level to 0.00866. If you collect large amounts of data, you should really consider lowering your alpha level.

If the maximum sample size we were willing to collect was 50 per group, the optimal alpha level to reduce the combined Type 1 and Type 2 error rates is 0.13. This means that we would have a 13% probability of deciding there is an effect when the null hypothesis is true. This is quite high! However, if we had used a 5% Type 1 error rate, the power would have been 69.69%, with a 30.31% Type 2 error rate, while the Type 2 error rate is ‘only’ 16.56% after increasing the alpha level to 0.13. We increase the Type 1 error rate by 8%, to reduce the Type 2 error rate by 13.5%. This increases the overall efficiency of the decisions we make.

This example relies on the pwr.t.test function in R, but any power function can be used. For example, the code to minimize the combined error rates for the power analysis for an equivalence test would be:

res = optimal_alpha(power_function = “powerTOSTtwo(alpha=x, N=200, low_eqbound_d=-0.4, high_eqbound_d=0.4)”)

Balancing Error Rates

You can choose to minimize the combined error rates, but you can also decide that it makes most sense to you to balance the error rates. For example, you think a Type 1 error is just as problematic as a Type 2 error, and therefore, you want to design a study that has balanced error rates for a smallest effect size of interest (e.g., a 5% Type 1 error rate and a 5% Type 2 error rate). Whether to minimize error rates or balance them can be specified in an additional argument in the function. The default it to minimize, but by adding error = “balance” an alpha level is given so that the Type 1 error rate equals the Type 2 error rate.

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “balance”)

res$alpha
## [1] 0.05488516
res$beta
## [1] 0.05488402

Repeating our earlier example, the alpha level is 0.055, such that the Type 2 error rate, given the smallest effect size of interest and the and the sample size, is also 0.055. I feel that even though this does not minimize the overall error rates, it is a justification strategy for your alpha level that often makes sense. If both Type 1 and Type 2 errors are equally problematic, we design a study where we are just as likely to make either mistake, for the effect size we care about.

Relative costs and prior probabilities

So far we have assumed a Type 1 error and Type 2 error are equally problematic. But you might believe Cohen (1988) was right, and Type 1 errors are exactly 4 times as bad as Type 2 errors. Or you might think they are twice as problematic, or 10 times as problematic. However you weigh them, as explained by Mudge et al., 2012, and Ulrich & Miller, 2019, you should incorporate those weights into your decisions.

The function has another optional argument, costT1T2, that allows you to specify the relative cost of Type1:Type2 errors. By default this is set to 1, but you can set it to 4 (or any other value) such that Type 1 errors are 4 times as costly as Type 2 errors. This will change the weight of Type 1 errors compared to Type 2 errors, and thus also the choice of the best alpha level.

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, costT1T2 = 4)

res$alpha
## [1] 0.01918735
res$beta
## [1] 0.1211773

Now, the alpha level that minimized the weighted Type 1 and Type 2 error rates is 0.019.

Similarly, you can take into account prior probabilities that either the null is true (and you will observe a Type 1 error), or that the alternative hypothesis is true (and you will observe a Type 2 error). By incorporating these expectations, you can minimize or balance error rates in the long run (assuming your priors are correct). Priors can be specified using the prior_H1H0 argument, which by default is 1 (H1 and H0 are equally likely). Setting it to 4 means you think the alternative hypothesis (and hence, Type 2 errors) are 4 times more likely than that the null hypothesis (and Type 1 errors).

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, prior_H1H0 = 2)

res$alpha
## [1] 0.07901679
res$beta
## [1] 0.03875676

If you think H1 is four times more likely to be true than H0, you need to worry less about Type 1 errors, and now the alpha that minimizes the weighted error rates is 0.079. It is always difficult to decide upon priors (unless you are Omniscient Jones) but even if you ignore them, you are making the decision that H1 and H0 are equally plausible.

Conclusion

You can’t abandon a practice without an alternative. Minimizing the combined error rate, or balancing error rates, provide two alternative approaches to the normative practice of setting the alpha level to 5%. Together with the approach to reduce the alpha level as a function of the sample size, I invite you to explore ways to set error rates based on something else than convention. A downside of abandoning mindless statistics is that you need to think of difficult questions. How much more negative is a Type 1 error than a Type 2 error? Do you have an ideas about the prior probabilities? And what is the smallest effect size of interest? Answering these questions is difficult, but considering them is important for any study you design. The experiments you make might very well be more informative, and more efficient. So give it a try.

References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillsdale, N.J: L. Erlbaum Associates.
Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The Long Way From ?-Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate. Perspectives on Psychological Science, 7(6), 661–669. https://doi.org/10.1177/1745691612462587
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
 Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631 
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

The New Heuristics

You can derive the age of a researcher based on the sample size they were told to use in a two independent group design. When I started my PhD, this number was 15, and when I ended, it was 20. This tells you I did my PhD between 2005 and 2010. If your number was 10, you have been in science much longer than I have, and if your number is 50, good luck with the final chapter of your PhD.
All these numbers are only sporadically the sample size you really need. As with a clock stuck at 9:30 in the morning, heuristics are sometimes right, but most often wrong. I think we rely way too often on heuristics for all sorts of important decisions we make when we do research. You can easily test whether you rely on a heuristic, or whether you can actually justify a decision you make. Ask yourself: Why?
I vividly remember talking to a researcher in 2012, a time where it started to become clear that many of the heuristics we relied on were wrong, and there was a lot of uncertainty about what good research practices looked like. She said: ‘I just want somebody to tell me what to do’. As psychologists, we work in a science where the answer to almost every research question is ‘it depends’. It should not be a surprise the same holds for how you design a study. For example, Neyman & Pearson (1933) perfectly illustrate how a statistician can explain the choices that need to be made, but in the end, only the researcher can make the final decision:
Due to a lack of training, most researchers do not have the skills to make these decisions. They need help, but do not even always have access to someone who can help them. It is therefore not surprising that articles and books that explain how to use useful tool provide some heuristics to get researchers started. An excellent example of this is Cohen’s classic work on power analysis. Although you need to think about the statistical power you want, as a heuristic, a minimum power of 80% is recommended. Let’s take a look at how Cohen (1988) introduces this benchmark.
It is rarely ignored. Note that we have a meta-heuristic here. Cohen argues a Type 1 error is 4 times as serious as a Type 2 error, and the Type 1 error is at 5%. Why? According to Fisher (1935) because it is a ‘convenient convention’. We are building a science on heuristics built on heuristics.
There has been a lot of discussion about how we need to improve psychological science in practice, and what good research practices look like. In my view, we will not have real progress when we replace old heuristics by new heuristics. People regularly complain to me about people who use what I would like to call ‘The New Heuristics’ (instead of The New Statistics), or ask me to help them write a rebuttal to a reviewer who is too rigidly applying a new heuristic. Let me give some recent examples.
People who used optional stopping in the past, and have learned this is p-hacking, think you can not look at the data as it comes in (you can, when done correctly, using sequential analyses, see Lakens, 2014). People make directional predictions, but test them with two-sided tests (even when you can pre-register your directional prediction). They think you need 250 participants (as an editor of a flagship journal claimed), even though there is no magical number that leads to high enough accuracy. They think you always need to justify sample sizes based on a power analysis (as a reviewer of a grant proposal claimed when rejecting a proposal) even though there are many ways to justify sample sizes. They argue meta-analysis is not a ‘valid technique’ only because the meta-analytic estimate can be biased (ignoring meta-analyses have many uses, including an analysis of heterogeneity, and all tests can be biased). They think all research should be preregistered or published as Registered Reports, even when the main benefit (preventing inflation of error rates for hypothesis tests due to flexibility in the data analysis) is not relevant for all research psychologists do. They think p-values are invalid and should be removed from scientific articles, even when in well-designed controlled experiments they might be the outcome of interest, especially early on in new research lines. I could go on.
Change is like a pendulum, swinging from one side to the other of a multi-dimensional space. People might be too loose, or too strict, too risky, or too risk-averse, too sexy, or too boring. When there is a response to newly identified problems, we often see people overreacting. If you can’t justify your decisions, you will just be pushed from one extreme on one of these dimensions to the opposite extreme. What you need is the weight of a solid justification to be able to resist being pulled in the direction of whatever you perceive to be the current norm. Learning The New Heuristics (for example setting the alpha level to 0.005 instead of 0.05) is not an improvement – it is just a change.
If we teach people The New Heuristics, we will get lost in the Bog of Meaningless Discussions About Why These New Norms Do Not Apply To Me. This is a waste of time. From a good justification it logically follows whether something applies to you or not. Don’t discuss heuristics – discuss justifications.
‘Why’ questions come at different levels. Surface level ‘why’ questions are explicitly left to the researcher – no one else can answer them. Why are you collecting 50 participants in each group? Why are you aiming for 80% power? Why are you using an alpha level of 5%? Why are you using this prior when calculating a Bayes factor? Why are you assuming equal variances and using Student’s t-test instead of Welch’s t-test? Part of the problem I am addressing here is that we do not discuss which questions are up to the researcher, and which are questions on a deeper level that you can simply accept without needing to provide a justification in your paper. This makes it relatively easy for researchers to pretend some ‘why’ questions are on a deeper level, and can be assumed without having to be justified. A field needs a continuing discussion about what we expect researchers to justify in their papers (for example by developing improved and detailed reporting guidelines). This will be an interesting discussion to have. For now, let’s limit ourselves to surface level questions that were always left up to researchers to justify (even though some researchers might not know any better than using a heuristic). In the spirit of the name of this blog, let’s focus on 20% of the problems that will improve 80% of what we do.
My new motto is ‘Justify Everything’ (it also works as a hashtag: #JustifyEverything). Your first response will be that this is not possible. You will think this is too much to ask. This is because you think that you will have to be able to justify everything. But that is not my view on good science. You do not have the time to learn enough to be able to justify all the choices you need to make when doing science. Instead, you could be working in a team of as many people as you need so that within your research team, there is someone who can give an answer if I ask you ‘Why?’. As a rule of thumb, a large enough research team in psychology has between 50 and 500 researchers, because that is how many people you need to make sure one of the researchers is able to justify why research teams in psychology need between 50 and 500 researchers.
Until we have transitioned into a more collaborative psychological science, we will be limited in how much and how well we can justify our decisions in our scientific articles. But we will be able to improve. Many journals are starting to require sample size justifications, which is a great example of what I am advocating for. Expert peer reviewers can help by pointing out where heuristics are used, but justifications are possible (preferably in open peer review, so that the entire community can learn). The internet makes it easier than ever before to ask other people for help and advice. And as with anything in a job as difficult as science, just get started. The #Justify20% hashtag will work just as well for now.

Statistics Predicted a Healthier Medieval London Following the Black Death

Black_Death

The Black Death, a pandemic at its height in Europe during the mid-14th century, was a virulent killer. It was so effective that it wiped out approximately one third of Europe’s population. Recent studies have shown that the elderly and the sick were most susceptible. But was the Black Death a “smart” killer?

A recent PLOS ONE study indicates that the Black Death’s virulence might have affected genetic variation in the surviving human population by reducing frailty, resulting in less virulent subsequent outbreaks of the plague. By examining the differences in survival rates and mortality risks in both pre-Black Death and post-Black Death samples of a London population—in combination with other, extrinsic factors, like differences in diet between the two groups—the researcher found that in London, on average, people lived longer following the plague than they did before it, despite repeated plague outbreaks. In other words, in terms of genetic variation, the Black Death positively affected the health of the surviving population.

To uncover differences in the health of medieval Londoners, Dr. Sharon DeWitte of the University of South Carolina examined 464 pre-Black Death individuals from three cemeteries and 133 post-Black Death individuals from one. She chose a diverse range of samples for a comprehensive view of the population, including both the rich and the poor, and women and children, but targeted one geographic location: London.

The ages-at-death of the samples were determined by calculating best estimates—in statistics these are called point estimates—based on particular indicators of age found on the skeletons’ hip and skull bones. Individuals’ ages were then evaluated against those in the Anthropological Database of Odense University, a pre-existing database comprising the Smithsonian’s Terry Collection and prior age-at-death data from 17th-century Danish parish records.

After estimating how old these individuals were when they died and comparing the age indicators against the Odense reference tool, the author conducted statistical analyses on the data to examine what the ages-at-death could tell us about the proportion of pre- and post- Black Death medieval Londoners who lived to a ripe old age, as well as the likelihood of death.

Survivorship was estimated using the Kaplan-Meier Estimator, a function used to indicate a quantity based on known data; in this case the function evaluated how long people lived in a given time period (pre-Black Death or post-Black Death). The calculated differences were significant: In particular, the proportion of adults who lived beyond the age of 50 from the post-Black Death group was much greater than those from the pre-Black Death group.

Age-at-death Distributions

Age-at-death Distributions

In the pre-Black Death group, death was most likely to occur between the ages of 10 and 19, as seen above.

The Kaplan-Meier survival plot shows how the chances of survival, which decrease with age, differ for Pre-Black Death and Post-Black Death groups, as seen below.

Survival Functions

Survival Functions

As the survival plot indicates, post-Black Death Londoners lived longer than there Pre-Black Death predecessors.

Finally, Dr. DeWitte estimated the risk of mortality by applying the age data to the statistical model known as the Gompertz hazard, which shows the typical pattern of increased risk in mortality with age. She found that overall post-Black Death Londoners faced lower risks of mortality than their pre-Black Death counterparts.

To make long and complicated methodology short, these analyses indicate that post-Black Death Londoners appear to have lived longer than pre-Black Death Londoners. The author estimates that the general population of London enjoyed a period of about 200 years of improved survivorship, based on these results.

The virulent killer, the Black Death, may have helped select for a healthier London by influencing genetic variation, at least in the short term. However, to better understand the improved quality of life of post-Black Death London, the author suggests further study to disentangle two major factors: the selectivity of the Black Death, coupled with improvements in lifestyle for post-Black Death individuals. For example, the massive depopulation in Europe resulted in increased wages for workers and improvements to diet following the plague, which also likely improved health for medieval Londoners. By unraveling intrinsic, biological changes in genetic variation from outside extrinsic factors like improvements in diet, it may be possible to better understand the aftermath of one of the most devastating killers in infectious disease history.

The EveryONE blog has more on the medieval killer here.

Citation: DeWitte SN (2014) Mortality Risk and Survival in the Aftermath of the Medieval Black Death. PLoS ONE 9(5): e96513. doi:10.1371/journal.pone.0096513

Image 1: The Black Death from Simple Wikipedia

Image 2: pone.0096513

Image 3: pone.0096513

The post Statistics Predicted a Healthier Medieval London Following the Black Death appeared first on EveryONE.

Biking the Distance… In 30 Minutes or Less: The Impact of Cost and Location on Urban Bike Share Systems

Citi Bike

Those of us who commute to the PLOS San Francisco office have noticed the emergence of bike share stations cropping up along the San Francisco Bay and on the city’s main drag. And we’re not alone here in San Francisco: the picture above is from the New York City Department of Transportation’s bike share. Around the world, bike share systems, which aim to make bicycles available on a short-term basis to anyone, have experienced massive growth as cities work to decrease gas emissions and encourage people to stay active. However, not everyone is ready to forgo the convenience of four wheels for two just yet. To understand why more people haven’t made the switch from cars to bike share systems, the author of a recently published PLOS ONE paper delved into possible factors affecting our willingness to don a helmet and cycle the distance.

Using publically available data from Washington DC and Boston, Dr. Jurdak, an Australian researcher, conducted a series of statistical analyses designed to examine the impact of bike share system pricing and neighborhood layout on potential bikers. It turns out cost is a major factor for commuters and tourists alike, but distance is not. Although analyses showed a bias towards shorter trips with a tendency towards a peak of 6 minutes—averaging 13 minutes per trip—a sharp drop off occurred in the likelihood of trips right around 30 minutes.

Why the decline at around 30 minutes? In both Boston and Washington DC, trips under 30 minutes incurred no additional cost in the bike share pricing system. Registered users of the bike share, typically commuters, must pay an initial registration fee but have a grace period for all trips completed in less than 30 minutes. Trips extending beyond 30 minutes, however, incur additional fees. In other words, public bicyclers are looking to maximize the distance biked and time spent without incurring any additional cost. Researchers have labeled this as ‘cost sensitivity.’

Statistical analyses also demonstrated the same cost sensitivity in casual users, or those who do not have a monthly or annual membership, and who likely use the bike share system for tourism. However, instead of noting a decline in the likelihood of trips around 30 minutes, Dr. Jurdak found a decline for casual users at around 60 minutes (another price point).

On the other hand, despite sensitivity to cost, bikers appeared less dissuaded from bike trips based on neighborhood layout. Although stations in Boston were on average much closer to other nearby stations than in Washington DC, in general, the trip distribution for both cities was remarkably similar. Perhaps not surprisingly, the most popular routes taken in both Boston and Washington DC were relatively flat.

To encourage more people to cut the car usage and grab a rental bike, Dr. Jurdak recommends that cities consider incentivizing their constituents with what they care about: cost. Modified prices for bike rental during peak hours may decrease car traffic on congested roads; an extension of grace periods for biking difficult topology, like up a steep San Francisco hill, might encourage us to bike even though the clock is ticking to 30 minutes and an incurred rise in price. As cities look to evolve public transportation systems and increase responsible urban mobility, and as city dwellers look for cost-effective ways to get around, bike share programs continue to offer healthy solutions for all, even at 30 minutes or less.

For more on the effects bike share systems are having around the world, check out another recent PLOS ONE paper and the researchers’ blog post on bike webs, visualizations of bike share schemes.

Citations:

Jurdak R (2013) The Impact of Cost and Network Topology on Urban Mobility: A Study of Public Bicycle Usage in 2 U.S. Cities. PLoS ONE 8(11): e79396. doi:10.1371/journal.pone.0079396

Image 1: Citi Bike Launch by New York City Department of Transportation

Policy Exceptions in RoMEO

Readers of this blog will have noticed the occasional notification of new exceptions that have been added to RoMEO.

But what are these exceptions and why are they important?

RoMEO has traditionally focussed on the general policies of publishers, those that cover the majority of their journals titles. However, some titles may have a different embargo period or use a Creative Commons License. Although, we have tried to impart this information under the General Conditions field, it has become cumbersome and still requires users to investigate themselves as to which embargo period applies to their journal of interest.

We started adding exceptions in November 2011, and are continuing the process as they are identified and we clarify the policy exceptions with publishers.

Some exceptions will cover only one journal title, others much more.

To date RoMEO lists a total of 59 exceptions, from 25 Publishers. We are still working through publishers we have identified as having possible exceptions and hope to add more in the future.

A list of the Exceptions added so far:

  • Akademie Ved Ceske Republiky, Knihovna
    • Knihy a dejiny [6/3/12]
  • American Medical Association
    • JAMA  [17/11/11]
  • American Society for Microbiology
    • mBio [26/4/12]
  • ASIS&T
    • Bulletin – [17/11/11]
    •  JASIS&T – [17/11/11]
  • American Society of Clinical Oncology
    • JCO [29/11/11]
    • JOP [29/11/11]
  • BMJ Publishing Group
    • BMJ [30/1/12]
    • BMj Open [18/4/12]
  • ediPUCRS
    • Analise [18/4/12]
    • BELT [18/4/12]
  • EDP Sciences
    • EDJ [26/4/12]
    • Creative Commons Attribution Non-Commercial [26/4/12]
  • Institut Français d’Etudes Andines (IFEA)
    • Bulletin de l’IFEA [23/3/12]
  • Laboratório Nacional de Energia e Geologia
    • Corrosão e Protecção de Materiais [13/12/11]
  • MIT Press
    • STM [17/11/11]
    •  Arts and Humanities [17/11/11]
    •  Economics [17/11/11]
  • Oxford University Press
    • Policy A – [16/11/11]
    • Policy A1 – [15/11/11]
    • Policy B – [16/11/11]
    • Policy B1 – [15/11/11]
    • Policy C –  [16/11/11]
    • Policy D – [15/11/11]
    • Policy E – [16/11/11]
    • Policy F – [15/11/11]
    • Policy G – [15/11/11]
    • Policy H – [15/11/11]
    • Policy I – [15/11/11]
    • Policy J – [15/11/11]
    • Policy K – [15/11/11]
    • Policy L – [15/11/11]
    • Policy M – [15/11/11]
    • Policy N – [16/11/11]
    • Policy O – [15/11/11]
    • Policy P [12/9/12]
    • Policy Q [12/9/12]
  •  Pion
    • i-Perception [10/5/12]
    • Perception [10/5/12]
  • Royal Society
    • Open Biology [19/7/12]
  • Taylor & Francis
    • SSH Titles [5/12/11]
    • STM Titles [5/12/11]
  • Taylor & Francis (Psychology Press)
    • STM Titles [5/12/11]
  • Taylor & Francis (Routledge)
    • LIS Titles [1/12/11]
    • SSH Titles [1/12/11]
    • STM Titles [1/12/11]
  • Universidad de Murcia [14/9/12]
    • Glosas Didacticas
  • Universidade de Brasilia
    • Attribution Non-Commercial  [17/9/12]
    • Attribution Non-Commercial No-Derivatives  [17/9/12]
    • Attribution Non-Commercial Share-Alike  [17/9/12]
  • Universite Paris 3, Institut des Hautes Etudes de l’Amérique Latine (IHEAL) [3/1/12]
    • Cahiers des Ameriques Latines
  • Univ Chig Press
    • Publications of the Astronomical Society of the Pacific [17/11/11]
  • Università degli Studi di Milano (University of Milan)
    • Attribution [17/4/12]
    • Share Alike [17/4/12]
    • Enthymema [17/4/12]
  • Uni of Texas Press
    • Cinema Journal [17/11/11]
  • Vittorio Klostermann
    • ZfBB [29/11/11]
  • Wildlife Society
    • Journal of Wildlife Management [18/4/12]