Tukey on Decisions and Conclusions

In 1955 Tukey gave a dinner talk about the difference between decisions and conclusions at a meeting of the Section of Physical and Engineering Science of the American Statistical Association. The talk was published in 1960. The distinction relates directly to different goals researchers might have when they collect data. This blog is largely a summary of his paper.


Tukey was concerned about the ‘tendency of decision theory to attempt to conquest all of statistics’. In hindsight, he needn’t have worried. In the social sciences, most statistics textbooks do not even discuss decision theory. His goal was to distinguish decisions from conclusions, to carve out a space for ‘conclusion theory’ to complement decision theory. He distinguishes decisions from conclusions.


In practice, making a decision means to ‘decide to act for the present as if’. Possible actions are defined, possible states of nature identified, and we make an inference about each state of nature. Decisions can be made even when we remain extremely uncertain about any ‘truth’. Indeed, in extreme cases we can even make decisions without access to any data. We might even decide to act as if two mutually exclusive states of nature are true! For example, we might buy a train ticket for a holiday three months from now, but also take out life insurance in case we die tomorrow.   


Conclusions differ from decisions. First, conclusions are established without taking consequences into consideration. Second, conclusions are used to build up a ‘fairly well-established body of knowledge’. As Tukey writes: “A conclusion is a statement which is to be accepted as applicable to the conditions of an experiment or observation unless and until unusually strong evidence to the contrary arises.” A conclusion is not a decision on how to act in the present. Conclusions are to be accepted, and thereby incorporated into what Frick (1996) calls a ‘corpus of findings’. According to Tukey, conclusions are used to narrow down the number of working hypotheses still considered consistent with observations. Conclusions should be reached, not based on their consequences, but because of their lasting (but not everlasting, as conclusions can now and then be overturned by new evidence) contribution to scientific knowledge.


Tests of hypotheses


According to Tukey, a test of hypotheses can have two functions. The first function is as a decision procedure, and the second function is to reach a conclusion. In a decision procedure the goal is to choose a course of action given an acceptable risk. This risk can be high. For example, a researcher might decide not to pursue a research idea after a first study, designed to have 80% power for a smallest effect size of interest, yields a non-significant result. The error rate is at most 20%, but the researcher might have enough good research ideas to not care.


The second function is to reach a conclusion. This is done, according to Tukey, by controlling the Type 1 and Type 2 error rate at ‘suitably low levels’ (Note: Tukey’s discussion of concluding an effect is absent is hindered somewhat by the fact that equivalence tests were not yet widely established in 1955 – Hodges & Lehman’s paper appeared in 1954). Low error rates, such as the conventions to use a 5% of 1% alpha level, are needed to draw conclusions that can enter the corpus of findings (even though some of these conclusions will turn out to be wrong, in the long run).


Why would we need conclusions?


One might reasonably wonder if we need conclusions in science. Tukey also ponders this question in Appendix 2. He writes “Science, in the broadest sense, is both one of the most successful of human affairs, and one of the most decentralized. In principle, each of us puts his evidence (his observations, experimental or not, and their discussion) before all the others, and in due course an adequate consensus of opinion develops.” He argues not for an epistemological reason, nor for a statistical reason, but for a sociological reason. Tukey writes: There are four types of difficulty, then, ranging from communication through assessment to mathematical treatment, each of which by itself will be sufficient, for a long time, to prevent the replacement, in science, of the system of conclusions by a system based more closely on today’s decision theory.” He notes how scientists can no longer get together in a single room (as was somewhat possible in the early decades of the Royal Society of London) to reach consensus about decisions. Therefore, they need to communicate conclusions, as “In order to replace conclusions as the basic means of communication, it would be necessary to rearrange and replan the entire fabric of science.” 


I hadn’t read Tukey’s paper when we wrote our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests”. In this preprint, we also discuss a sociological reason for the presence of dichotomous claims in science. We also ask: “Would it be possible to organize science in a way that relies less on tests of competing theories to arrive at intersubjectively established facts about phenomena?” and similarly conclude: “Such alternative approaches seem feasible if stakeholders agree on the research questions that need to be investigated, and methods to be utilized, and coordinate their research efforts”.  We should add a citation to Tukey’s 1960 paper.


Is the goal of an study a conclusion, a decision, or both?


Tukey writes he “looks forward to the day when the history and status of tests of hypotheses will have been disentangled.” I think that in 2022 that day has not yet come. At the same time, Tukey admits in Appendix 1 that the two are sometimes intertwined.


A situation Tukey does not discuss, but that I think is especially difficult to disentangle, is a cumulative line of research. Although I would prefer to only build on an established corpus of findings, this is simply not possible. Not all conclusions in the current literature are reached with low error rates. This is true both for claims about the absence of an effect (which are rarely based on an equivalence test against a smallest effect size of interest with a low error rate), as for claims about the presence of an effect, not just because of p-hacking, but also because I might want to build on an exploratory finding from a previous study. In such cases, I would like to be able to concludethe effects I build on are established findings, but more often than not, I have to decide these effects are worth building on. The same holds for choices about the design of a set of studies in a research line. I might decide to include a factor in a subsequent study, or drop it. These decisions are based on conclusions with low error rates if I had the resources to collect large samples and perform replication studies, but other times they involve decisions about how to act in my next study with quite considerable risk.


We allow researchers to publish feasibility studies, pilot studies, and exploratory studies. We don’t require every study to be a Registered Report of Phase 3 trial. Not all information in the literature that we build on has been established with the rigor Tukey associates with conclusions. And the replication crisis has taught us that more conclusions from the past are later rejected than we might have thought based on the alpha levels reported in the original articles. And in some research areas, where data is scarce, we might need to accept that, if we want to learn anything, the conclusions will always more tentative (and the error rates accepted in individual studies will be higher) than in research areas where data is abundant.


Even if decisions and conclusions can not be completely disentangled, reflecting on their relative differences is very useful, as I think it can help us to clarify the goal we have when we collect data. 


For a 2013 blog post by Justin Esarey, who found the distinction a bit less useful than I found it, see https://polmeth.org/blog/scientific-conclusions-versus-scientific-decisions-or-we%E2%80%99re-having-tukey-thanksgiving



Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1(4), 379–390. https://doi.org/10.1037/1082-989X.1.4.379

Tukey, J. W. (1960). Conclusions vs decisions. Technometrics, 2(4), 423–433.

Uygun Tunç, D., Tunç, M. N., & Lakens, D. (2021). The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests. PsyArXiv. https://doi.org/10.31234/osf.io/af9by






Review of "The Generalizability Crisis" by Tal Yarkoni

A response to this blog by Tal Yarkoni is here.
In a recent preprint titled “The Generalizability Crisis“, Tal Yarkoni examines whether the current practice of how psychologists generalize from studies to theories is problematic. He writes: “The question taken up in this paper is whether or not the tendency to generalize psychology findings far beyond the circumstances in which they were originally established is defensible. The case I lay out in the next few sections is that it is not, and that unsupported generalization lies at the root of many of the methodological and sociological challenges currently affecting psychological science.” We had a long twitter discussion about the paper, and then read it in our reading group. In this review, I try to make my thoughts about the paper clear in one place, which might be useful if we want to continue to discuss whether there is a generalizability crisis, or not.

First, I agree with Yarkoni that almost all the proposals he makes in the section “Where to go from here?” are good suggestions. I don’t think they follow logically from his points about generalizability, as I detail below, but they are nevertheless solid suggestions a researcher should consider. Second, I agree that there are research lines in psychology where modelling more things as random factors will be productive, and a forceful manifesto (even if it is slightly less practical than similar earlier papers) might be a wake up call for people who had ignored this issue until now.

Beyond these two points of agreement, I found the main thesis in his article largely unconvincing. I don’t think there is a generalizability crisis, but the article is a nice illustration of why philosophers like Popper abandoned the idea of an inductive science. When Yarkoni concludes that “A direct implication of the arguments laid out above is that a huge proportion of the quantitative inferences drawn in the published psychology literature are so inductively weak as to be at best questionable and at worst utterly insensible.” I am primarily surprised he believes induction is a defensible philosophy of science. There is a very brief discussion of views by Popper, Meehl, and Mayo on page 19, but their work on testing theories is proposed as a probable not feasible solution – which is peculiar, because these authors would probably disagree with most of the points made by Yarkoni, and I would expect at least somewhere in the paper a discussion comparing induction against the deductive approach (especially since the deductive approach is arguably the dominant approach in psychology, and therefore none of the generalizability issues raised by Yarkoni are a big concern). Because I believe the article starts from a faulty position (scientists are not concerned with induction, but use deductive approaches) and because Yarkoni provides no empirical support for any of his claims that generalizability has led to huge problems (such as incredibly high Type 1 error rates), I remain unconvinced there is anything remotely close to the generalizability crisis he so evocatively argues for. The topic addressed by Yarkoni is very broad. It probably needs a book length treatment to do it justice. My review is already way too long, and I did not get into the finer details of the argument. But I hope this review helps to point out the parts of the manuscript where I feel important arguments lack a solid foundation, and where issues that deserve to be discussed are ignored.

Point 1: “Fast” and “slow” approaches need some grounding in philosophy of science.

Early in the introduction, Yarkoni says there is a “fast” and “slow” approach of drawing general conclusions from specific observations. Whenever people use words that don’t exactly describe what they mean, putting them in quotation marks is generally not a good idea. The “fast” and “slow” approaches he describes are not, I believe upon closer examination, two approaches “of drawing general conclusions from specific observations”.

The difference is actually between induction (the “slow” approach of generalizing from single observations to general observations) and deduction, as proposed by for example Popper. As Popper writes “According to the view that will be put forward here, the method of critically testing theories, and selecting them according to the results of tests, always proceeds on the following lines. From a new idea, put up tentatively, and not yet justified in any way—an anticipation, a hypothesis, a theoretical system, or what you will—conclusions are drawn by means of logical deduction.”

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments’”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the e?ect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.” Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

Point 2: Titles are not evidence for psychologist’s tendency to generalize too quickly.

This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

For me, this observation raised serious concerns about the statement Yarkoni makes that, simply from the titles of scientific articles, we can make a statement about whether authors make ‘fast’ or ‘slow’ generalizations. One reason is that Yarkoni examined titles from a scientific article that adheres to the publication manual of the APA. In the section on titles, the APA states: “A title should summarize the main idea of the manuscript simply and, if possible, with style. It should be a concise statement of the main topic and should identify the variables or theoretical issues under investigation and the relationship between them. An example of a good title is “Effect of Transformed Letters on Reading Speed.””. To me, it seems the authors are simply following the APA publication manual. I do not think their choice for a title provides us with any insight whatsoever about the tendency of authors to have a preference for ‘fast’ generalization. Again, this might be a minor point, but I found this an illustrative example of the strength of arguments in other places (see the next point for the most important example). Yarkoni needs to make a case that scientists are overgeneralizing, for there to be a generalizability crisis – but he does so unconvincingly. I sincerely doubt researchers expect their findings to generalize to all possible situations mentioned in the title, I doubt scientists believe titles are the place to accurately summarize limits of generalizability, and I doubt Yarkoni has made a strong point that psychologists overgeneralize based on this section. More empirical work would be needed to build a convincing case (e.g., code how researchers actually generalize their findings in a random selection of 250 articles, taking into account Gricean communication norms (especially the cooperative principle) in scientific articles).

Point 3: Theories and tests are not perfectly aligned in deductive approaches.

After explaining that psychologists use statistics to test predictions based on experiments that are operationalizations of verbal theories, Yarkoni notes: “From a generalizability standpoint, then, the key question is how closely the verbal and quantitative expressions of one’s hypothesis align with each other.”

Yarkoni writes: “When a researcher verbally expresses a particular hypothesis, she is implicitly defining a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis. If the researcher subsequently asserts that a particular statistical procedure provides a suitable test of the verbal hypothesis, she is making the tacit but critical assumption that the universe of admissible observations implicitly defined by the chosen statistical procedure (in concert with the experimental design, measurement model, etc.) is well aligned with the one implicitly defined by the qualitative hypothesis. Should a discrepancy between the two be discovered, the researcher will then face a choice between (a) working to resolve the discrepancy in some way (i.e., by modifying either the verbal statement of the hypothesis or the quantitative procedure(s) meant to provide an operational parallel); or (b) giving up on the link between the two and accepting that the statistical procedure does not inform the verbal hypothesis in a meaningful way.

I highlighted what I think is the critical point is in a bold font. To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed e?ects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned. As Yarkoni accurately summarizes based on an large multi-lab study on verbal overshadowing by Alogna: “given very conservative background assumptions, the massive Alogna et al. study—an initiative that drew on the efforts of dozens of researchers around the world—does not tell us much about the general phenomenon of verbal overshadowing. Under more realistic assumptions, it tells us essentially nothing.” This is also why Yarkoni’s first practical recommendation on how to move forward is to not solve the problem, but to do something else: “One perfectly reasonable course of action when faced with the difficulty of extracting meaningful, widely generalizable conclusions from e?ects that are inherently complex and highly variable is to opt out of the enterprise entirely.”

This is exactly the reason Popper (among others) rejected induction, and proposed a deductive approach. Why isn’t the alignment between theories and tests raised by Yarkoni a problem for the deductive approach proposed by Popper, Meehl, and Mayo? The reason is that the theory is tentatively posited as true, but in no way believed to be a complete representation of reality. This is an important difference. Yarkoni relies on an inductive approach, and thus the test needs to be aligned with the theory, and the theory defines “a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis.” For deductive approaches, this is not true.

For philosophers of science like Popper and Lakatos, a theory is not a complete description of reality. Lakatos writes about theories: “Each of them, at any stage of its development, has unsolved problems and undigested anomalies. All theories, in this sense, are born refuted and die refuted.” Lakatos gives the example that Newton’s Principia could not even explain the motion of the moon when it was published. The main point here: All theories are wrong. The fact that all theories (or models) are wrong should not be surprising. Box’s quote “All models are wrong, some are useful” is perhaps best known, but I prefer Box (1976) on parsimony: “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William Ockham (1285-1349) he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity (Ockham’s knife).” He follows this up by stating “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

In a deductive approach, the goal of a theoretical model is to make useful predictions. I doubt anyone believes that any of the models they are currently working on is complete. Some researchers might follow an instrumentalist philosophy of science, and don’t expect their theories to be anything more than useful tools. Lakatos’s (1978) main contribution to philosophy of science was to develop a way we deal with our incorrect theories, admitting that all needed adjustment, but some adjustments lead to progressive research lines, and others to degenerative research lines.

In a deductive model, it is perfectly fine to posit a theory that eating ice-cream makes people happy, without assuming this holds for all flavors, across all cultures, at all temperatures, and is irrespective of the amount of ice-cream eaten previously, and many other factors. After all, it is just a tentatively model that we hope is simple enough to be useful, and that we expect to become more complex as we move forward. As we increase our understanding of food preferences, we might be able to modify our theory, so that it is still simple, but also allows us to predict the fact that eggnog and bacon flavoured ice-cream do not increase happiness (on average). The most important thing is that our theory is tentative, and posited to allow us to make good predictions. As long as the theory is useful, and we have no alternatives to replace it with, the theory will continue to be used – without any expectation that is will generalize to all possible situations. As Box (1976) writes: “Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory.” A discussion of this large gap between Yarkoni and deductive approaches proposed by Popper and Meehl, where Yarkoni thinks theories and tests need to align, and deductive approaches see theories as tentative and wrong, should be included, I think. 

Point 4: The dismissal of risky predictions is far from convincing (and generalizability is typically a means to risky predictions, not a goal in itself).

If we read Popper (but also on the statistical side the work of Neyman) we see induction as a possible goal in science is clearly rejected. Yarkoni mentions deductive approaches briefly in his section on adopting better standards, in the sub-section on making riskier predictions. I intuitively expected this section to be crucial – after all, it finally turns to those scholars who would vehemently disagree with most of Yarkoni’s arguments in the preceding sections – but I found this part rather disappointing. Strangely enough, Yarkoni simply proposes predictions as a possible solution – but since the deductive approach goes directly against the inductive approach proposed by Yarkoni, it seems very weird to just mention risky predictions as one possible solution, when it is actually a completely opposite approach that rejects most of what Yarkoni argues for. Yarkoni does not seem to believe that the deductive mode proposed by Popper, Meehl, and Mayo, a hypothesis testing approach that is arguably the dominant approach in most of psychology (Cortina & Dunlap, 1997; Dienes, 2008; Hacking, 1965), has a lot of potential. The reason he doubts severe tests of predictions will be useful is that “in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding” (Yarkoni, p. 19). This could be resolved if risky predictions were possible, which Yarkoni doubts.

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests. It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

Point 5: Why care about statistical inferences, if these do not relate to sweeping verbal conclusions?

If we ignore all points previous points, we can still read Yarkoni’s paper as a call to introduce more random factors in our experiments. This nicely complements recent calls to vary all factors you do not thing should change the conclusions you draw (Baribault et al., 2018), and classic papers on random effects (Barr et al., 2013; Clark, 1969; Cornfield & Tukey, 1956).

Yarkoni generalizes from the fact that most scientists model subjects as a random factor, and then asks why scientists generalize to all sorts of other factors that were not in their models. He asks “Why not simply model all experimental factors, including subjects, as fixed e?ects”. It might be worth noting in the paper that sometimes researchers model subjects as fixed effects. For example, Fujisaki and Nishida (2009) write: “Participants were the two authors and five paid volunteers” and nowhere in their analyses do they assume there is any meaningful or important variation across individuals. In many perception studies, an eye is an eye, and an ear is an ear – whether from the author, or a random participant dragged into the lab from the corridor.

In other research areas, we do model individuals as a random factor. Yarkoni says we model stimuli as a random factor because: “The reason we model subjects as random e?ects is not that such a practice is objectively better, but rather, that this specification more closely aligns the meaning of the quantitative inference with the meaning of the qualitative hypothesis we’re interested in evaluating”. I disagree. I think we model certain factor as random effects because we have a high prior these factors influence the effect, and leaving them out of the model would reduce the strength of our prediction. Leaving them out reduces the probability a test will show we are wrong, if we are wrong. It impacts the severity of the test. Whether or not we need to model factors (e.g., temperature, the experimenter, or day of the week) as random factors because not doing so reduces the severity of a test is a subjective judgments. Research fields need to decide for themselves. It is very well possible more random factors are generally needed, but I don’t know how many, and doubt it will ever be as severe are the ‘generalizability crisis’ suggests. If it is as severe as Yarkoni suggests, some empirical demonstrations of this would be nice. Clark (1973) showed his language-as-fixed-effect fallacy using real data. Barr et al (2013) similarly made their point based on real data. I currently do not find the theoretical point very strong, but real data might convince me otherwise.

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic. Similarly, Cornfield & Tukey (1956) more pragmatically list options ranging from ignoring factors altogether, to randomizing them, or including them as a factor, and note “Each of these attitudes is appropriate in its place. In every experiment there are many variables which could enter, and one of the great skills of the experimenter lies in leaving out only inessential ones.” Just as pragmatically, Clark (1973) writes: “The wide-spread capitulation to the language-as-fixed-effect fallacy, though alarming, has probably not been disastrous. In the older established areas, most experienced investigators have acquired a good feel for what will replicate on a new language sample and what will not. They then design their experiments accordingly.” As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why. In some ways, Yarkoni’s point generalizes the argument that most findings in psychology do not generalize to non-WEIRD populations (Henrich et al., 2010), and it has the same weakness. WEIRD is a nice acronym, but it is just a completely random collection of 5 factors that might limit generalizability. The WEIRD acronym functions more as a nice reminder that boundary conditions exist, but it does not allow us to predict when they exist, or when they matter enough to be included in our theories. Currently, there is a gap between the factors that in theory could matter, and the factors that we should in practice incorporate. Maybe it is my pragmatic nature, but without such a discussion, I think the paper offers relatively little progress compared to previous discussions about generalizability (of which there are plenty).


A large part of Yarkoni’s argument is based on the fact that theories and tests should be closely aligned, while in a deductive approach based on severe tests of predictions, models are seen as simple, tentative, and wrong, and this is not considered a problem. Yarkoni does not convincingly argue researchers want to generalize extremely broadly (although I agree papers would benefit from including Constraints on Generalizability statements a proposed by Simons and colleagues (2017), but mainly because this improves falsifiability, not because it improves induction), and even if there is the tendency to overclaim in articles, I do not think this leads to an inferential crisis. Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice. Until Yarkoni does the latter convincingly, I don’t think the generalizability crisis as he sketches it is something that will keep me up at night.


Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Ravenzwaaij, D. van, White, C. N., Boeck, P. D., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3). https://doi.org/10.1016/j.jml.2012.11.001

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10/gdm28w

Clark, H. H. (1969). Linguistic processes in deductive reasoning. Psychological Review, 76(4), 387–404. https://doi.org/10.1037/h0027578

Cornfield, J., & Tukey, J. W. (1956). Average Values of Mean Squares in Factorials. The Annals of Mathematical Statistics, 27(4), 907–949. https://doi.org/10.1214/aoms/1177728067

Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161.

Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Palgrave Macmillan.

Fujisaki, W., & Nishida, S. (2009). Audio–tactile superiority over visuo–tactile and audio–visual combinations in the temporal resolution of synchrony perception. Experimental Brain Research, 198(2), 245–259. https://doi.org/10.1007/s00221-009-1870-x

Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29–29.

Lakens, D. (2020). The Value of Preregistration for Psychological Science: A Conceptual Analysis. Japanese Psychological Review. https://doi.org/10.31234/osf.io/jbh4w

Munafò, M. R., & Smith, G. D. (2018). Robust research needs many lines of evidence. Nature, 553(7689), 399–401. https://doi.org/10.1038/d41586-018-01023-3

Orben, A., & Lakens, D. (2019). Crud (Re)defined. https://doi.org/10.31234/osf.io/96dpy

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on Generality (COG): A Proposed Addition to All Empirical Papers. Perspectives on Psychological Science, 12(6), 1123–1128. https://doi.org/10.1177/1745691617708630

Do You Really Want to Test a Hypothesis?

I’ve uploaded one of my favorite lectures in the my new MOOC “Improving Your Statistical Questions” to YouTube. It asks the question whether you really want to test a hypothesis. A hypothesis is a very specific tool to answer a very specific question. I like hypothesis tests, because in experimental psychology it is common to perform lines of research where you can design a bunch of studies that test simple predictions about the presence or absence of differences on some measure. I think they have a role to play in science. I also think hypothesis testing is widely overused. As we are starting to do hypothesis tests better (e.g., by preregisteringour predictions and controlling our error rates in more severe tests) I predict many people will start to feel a bit squeamish as they become aware that doing hypothesis tests as they were originally designed to be used isn’t really want they want in their research. One of the often overlooked gains in teaching people how to do something well, is that they finally realize that they actually don’t want to do it.
The lecture “Do You Really Want to Test a Hypothesis” aims to explain which question a hypothesis tests asks, and discusses when a hypothesis tests answers a question you are interested in. It is very easy to say what not to do, or to point out what is wrong with statistical tools. Statistical tools are very limited, even under ideal circumstances. It’s more difficult to say what you can do. If you follow my work, you know that this latter question is what I spend my time on. Instead of telling you optional stopping can’t be done because it is p-hacking, I explain how you can do it correctly through sequential analysis. Instead of telling you it is wrong to conclude the absence of an effect from p > 0.05, I explain how to use equivalence testing­­. Instead of telling you p-values are the devil, I explain how they answer a question you might be interested in when used well. Instead of saying preregistration is redundant, I explain from which philosophy of science preregistration has value. And instead of saying we should abandon hypothesis tests, I try to explain in this video how to use them wisely. This is all part of my ongoing #JustifyEverything educational tour. I think it is a reasonable expectation that researchers should be able to answer at least a simple ‘why’ question if you ask why they use a specific tool, or use a tool in a specific manner.
This might help to move beyond the simplistic discussion I often see about these topics. If you ask me if I prefer frequentist of Bayesian statistics, or confirmatory or exploratory research, I am most likely to respond ? (see Wikipedia). It is tempting to think about these topics in a polarized either-or mindset – but then you would miss asking the real questions. When would any approach give you meaningful insights? Just as not every hypothesis test is an answer to a meaningful question, so will not every exploratory study provide interesting insights. The most important question to ask yourself when you plan a study is ‘when will the tools you use lead to interesting insights’? In the second week of my MOOC I discuss when effects in hypothesis tests could be deemed meaningful, but the same question applies to exploratory or descriptive research. Not all exploration is interesting, and we don’t want to simply describe every property of the world. Again, it is easy to dismiss any approach to knowledge generation, but it is so much more interesting to think about which tools willlead to interesting insights. And above all, realize that in most research lines, researchers will have a diverse set of questions that they want to answer given practical limitations, and they will need to rely on a diverse set of tools, limitations and all.
In this lecture I try to explain what the three limitations are of hypothesis tests, and the very specific question they try to answer. If you like to think about how to improve your statistical questions, you might be interested in enrolling in my free MOOC Improving Your Statistical Questions”.

The Value of Preregistration for Psychological Science: A Conceptual Analysis

This blog is an excerpt of an invited journal article for a special issue of Japanese Psychological Review, that I am currently one week overdue with (but that I hope to complete soon). I hope this paper will raise the bar in the ongoing discussion about the value of preregistration in psychological science. If you have any feedback on what I wrote here, I would be very grateful to hear it, as it would allow me to improve the paper I am working on. If we want to fruitfully discuss preregistration, researchers need to provide a clear conceptual definition of preregistration, anchored in their philosophy of science.
For as long as data has been used to support scientific claims, people have tried to selectively present data in line with what they wish to be true. In his treatise ‘On the Decline of Science in England: And on Some of its Cases’ Babbage (1830) discusses what he calls cooking: “One of its numerous processes is to make multitudes of observations, and out of these to select those only which agree or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he can not pick out fifteen or twenty that will do up for serving.” In the past researchers have proposed solutions to prevent bias in the literature. With the rise of the internet it has become feasible to create online registries that ask researchers to specify their research design and the planned analyses. Scientific communities have started to make use of this opportunity (for a historical overview, see Wiseman, Watt, & Kornbrot, 2019).
Preregistration in psychology has been a good example of ‘learning by doing’. Best practices are continuously updated as we learn from practical challenges and early meta-scientific investigations into how preregistrations are performed. At the same time, discussions have emerged about what the goal of preregistration is, whether preregistration is desirable, and what preregistration should look like across different research areas. Every practice comes with costs and benefits, and it is useful to evaluate whether and when preregistration is worth it. Finally, it is important to evaluate how preregistration relates to different philosophies of science, and when it facilitates or distracts from goals scientists might have. The discussion about benefits and costs of preregistration has not been productive up to now because there is a general lack of a conceptual analysis of what preregistration entails and aims to accomplish, which leads to disagreements that are easily resolved when a conceptual definition would be available. Any conceptual definition about a tool that scientists use, such as preregistration, must examine the goals it achieves, and thus requires a clearly specified view on philosophy of science, which provides an analysis of different goals scientists might have. Discussing preregistration without discussing philosophy of science is a waste of time.

What is Preregistration For?

Preregistration has the goal to transparently prevent bias due to selectively reporting analyses. Since bias in estimates only occurs in relation to a true population parameter, preregistration as discussed here is limited to scientific questions that involve estimates of population values from samples. Researchers can have many different goals when collecting data, perhaps most notably theory development, as opposed to tests of statistical predictions derived from theories. When testing predictions, researchers might want a specific analysis to yield a null effect, for example to show that including a possible confound in an analysis does not change their main results. More often perhaps, they want an analysis to yield a statistically significant result, for example so that they can argue the results support their prediction, based on a p-value below 0.05. Both examples are sources of bias in the estimate of a population effect size. In this paper I will assume researchers use frequentist statistics, but all arguments can be generalized to Bayesian statistics (Gelman & Shalizi, 2013). When effect size estimates are biased, for example due to the desire to obtain a statistically significant result, hypothesis tests performed on these estimates have inflated Type 1 error rates, and when bias emerges due to the desire to obtain a non-significant test result, hypothesis tests have reduced statistical power. In line with the general tendency to weigh Type 1 error rates (the probability of obtaining a statistically significant result when there is no true effect) as more serious than Type 2 error rates (the probability of obtaining a non-significant result when there is a true effect), publications that discuss preregistration have been more concerned with inflated Type 1 error rates than with low power. However, one can easily think of situations where the latter is a bigger concern.
If the only goal of a researcher is to prevent bias it suffices to make a mental note of the planned analyses, or to verbally agree upon the planned analysis with collaborators, assuming we will perfectly remember our plans when analyzing the data. The reason to write down an analysis plan is not to prevent bias, but to transparently prevent bias. By including transparency in the definition of preregistration it becomes clear that the main goal of preregistration is to convince others that the reported analysis tested a clearly specified prediction. Not all approaches to knowledge generation value prediction, and it is important to evaluate if your philosophy of science values prediction to be able to decide if preregistration is a useful tool in your research. Mayo (2018) presents an overview of different arguments for the role prediction plays in science and arrives at a severity requirement: We can build on claims that passed tests that were highly capable of demonstrating the claim was false, but supported the prediction nevertheless. This requires that researchers who read about claims are able to evaluate the severity of a test. Preregistration facilitates this.
Although falsifying theories is a complex issue, falsifying statistical predictions is straightforward. Researchers can specify when they will interpret data as support for their claim based on the result of a statistical test, and when not. An example is a directional (or one-sided) t-test testing whether an observed mean is larger than zero. Observing a value statistically smaller or equal to zero would falsify this statistical prediction (as long as statistical assumptions of the test hold, and with some error rate in frequentist approaches to statistics). In practice, only range predictions can be statistically falsified. Because resources and measurement accuracy are not infinitely large, there is always a value close enough to zero that is statistically impossible to distinguish from zero. Therefore, researchers will need to specify at least some possible outcomes that would not be considered support for their prediction that statistical tests can pick up on. How such bounds are determined is a massively understudied problem in psychology, but it is essential to have falsifiable predictions.
Where bounds of a range prediction enable statistical falsification, the specification of these bounds is not enough to evaluate how highly capable a test was to demonstrate a claim was wrong. Meehl (1990) argues that we are increasingly impressed by a prediction, the more ways a prediction could have been wrong.  He writes (1990, p. 128): “The working scientist is often more impressed when a theory predicts something within, or close to, a narrow interval than when it predicts something correctly within a wide one.” Imagine making a prediction about where a dart will land if I throw it at a dartboard. You will be more impressed with my darts skills if I predict I will hit the bullseye, and I hit the bullseye, than when I predict to hit the dartboard, and I hit the dartboard. Making very narrow range predictions is a way to make it statistically likely to falsify your prediction, if it is wrong. It is also possible to make theoretically risky predictions, for example by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory. Regardless of how researchers increase the capability of a test to be wrong, the approach to scientific progress described here places more faith in claims based on predictions that have a higher capability of being falsified, but where data nevertheless supports the prediction. Anyone is free to choose a different philosophy of science, and create a coherent analysis of the goals of preregistration in that framework, but as far as I am aware, Mayo’s severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration.
Researchers admit to research practices that make their predictions, or the empirical support for their prediction, look more impressive than it is. One example of such a practice is optional stopping, where researchers collect a number of datapoints, perform statistical analyses, and continue the data collection if the result is not statistically significant. In theory, a researcher who is willing to continue collecting data indefinitely will always find a statistically significant result. By repeatedly looking at the data, the Type 1 error rate can inflate to 100%. Even though in practice the inflation will be smaller, optional stopping strongly increases the probability that a researcher can interpret their result as support for their prediction. In the extreme case, where a researcher is 100% certain that they will observe a statistically significant result when they perform their statistical test, their prediction will never be falsified. Providing support for a claim by relying on optional stopping should not increase our faith in the claim by much, or even at all. As Mayo (2018, p. 222) writes: “The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’s objectivity.” If researchers were to transparently register their data collection strategy, readers could evaluate the capability of the test to falsify their prediction, conclude this capability is very small, and be relatively unimpressed by the study. If the stopping rule keeps the probability of finding a non-significant result when the prediction is incorrect high, and the data nevertheless support the prediction, we can choose to act as if the claim is correct because it has been severely tested. Preregistration thus functions as a tool to allow other researchers te transparently evaluate the severity with which a claim has been tested.
The severity of a test can also be compromised by selecting a hypothesis based on the observed results. In this practice, known as Hypothesizing After the Results are Known (HARKing, Kerr, 1998) researchers look at their data, and then select a prediction. This reversal of the typical hypothesis testing procedure makes the test incapable of demonstrating the claim was false. Mayo (2018) refers to this as ‘bad evidence, no test’. If we choose a prediction from among the options that yield a significant result, the claims we make base on these ‘predictions’ will never be wrong. In philosophies of science that value predictions, such claims do not increase our confidence that the claim is true, because it has not yet been tested. By preregistering our predictions, we transparently communicate to readers that our predictions predated looking at data, and therefore that the data we present as support of our prediction could have falsified our hypothesis. We have not made our test look more severe by narrowing the range of our predictions after looking at the data (like the Texas sharpshooter who draws the circles of the bullseye after shooting at the wall of the barn). A reader can transparently evaluate how severely our claim was tested.
As a final example of the value of preregistration to transparently allow readers to evaluate the capability of our prediction to be falsified, think about the scenario described by Babbage at the beginning of this article, where a researchers makes multitudes of observations, and selects out of all these tests only those that support their prediction. The larger the number of observations to choose from, the higher the probability that one of the possible tests could be presented as support for the hypothesis. Therefore, from a perspective on scientific knowledge generation where severe tests are valued, choosing to selectively report tests from among many tests that were performed strongly reduces the capability of a test to demonstrate the claim was false. This can be prevented by correcting for multiple testing by lowering the alpha level depending on the number of tests.
The fact that preregistration is about specifying ways in which your claim could be false is not generally appreciated. Preregistrations should carefully specify not just the analysis researchers plan to perform, but also when they would infer from the analyses that their prediction was wrong. As the preceding section explains, successful predictions impress us more when the data that was collected was capable of falsifying the prediction. Therefore, a preregistration document should give us all the required information that allows us to evaluate the severity of the test. Specifying exactly which test will be performed on the data is important, but not enough. Researchers should also specify when they will conclude the prediction was not supported. Beyond specifying the analysis plan in detail, the severity of a test can be increased by narrowing the range of values that are predicted (without increasing the Type 1 and Type 2 error rate), or making the theoretical prediction more specific by specifying detailed circumstances under which the effect will be observed, and when it will not be observed.

When is preregistration valuable?

If one agrees with the conceptual analysis above, it follows that preregistration adds value for people who choose to increase their faith in claims that are supported by severe tests and predictive successes. Whether this seems reasonable depends on your philosophy of science. Preregistration itself does not make a study better or worse compared to a non-preregistered study. Sometimes, being able to transparently evaluate a study (and its capability to demonstrate claims were false) will reveal a study was completely uninformative. Other times we might be able to evaluate the capability of a study to demonstrate a claim was false even if the study is not transparently preregistered. Examples are studies where there is no room for bias, because the analyses are perfectly constrained by theory, or because it is not possible to analyze the data in any other way than was reported. Although the severity of a test is in principle unrelated to whether it is pre-registered or not, in practice there will be a positive correlation that is caused by the studies where the ability to evaluate how capable these studies were to demonstrate a claim was false is improved by transparently preregistering, such as studies with multiple dependent variables to choose from, which do not use standardized measurement scale so that the dependent variable can be calculated in different ways, or where additional data is easily collected, to name a few.
We can apply our conceptual analysis of preregistration to hypothetical real-life situations to gain a better insight into when preregistration is a valuable tool, and when not. For example, imagine a researcher who preregisters an experiment where the main analysis tests a linear relationship between two variables. This test yields a non-significant result, thereby failing to support the prediction. In an exploratory analysis the authors find that fitting a polynomial model yields a significant test result with a low p-value. A reviewer of their manuscript has studied the same relationship, albeit in a slightly different context and with another measure, and has unpublished data from multiple studies that also yielded polynomial relationships. The reviewer also has a tentative idea about the underlying mechanism that causes not a linear, but a polynomial, relationship. The original authors will be of the opinion that the claim of a polynomial relationship has passed a less severe test than their original prediction of a linear prediction would have passed (had it been supported). However, the reviewer would never have preregistered a linear relationship to begin with, and therefore does not evaluate the switch to a polynomial test in the exploratory result section as something that reduces the severity of the test. Given that the experiment was well-designed, the test for a polynomial relationship will be judged as having greater severity by the reviewer than by the authors. In this hypothetical example the reviewer has additional data that would have changed the hypothesis they would have preregistered in the original study. It is also possible that the difference in evaluation of the exploratory test for a polynomial relationship is based purely on a subjective prior belief, or on the basis of knowledge about an existing well-supported theory that would predict a polynomial, but not a linear, relationship.
Now imagine that our reviewer asks for the raw data to test whether their assumed underlying mechanism is supported. They receive the dataset, and looking through the data and the preregistration, the reviewer realizes that the original authors didn’t adhere to their preregistered analysis plan. They violated their stopping rule, analyzing the data in batches of four and stopping earlier than planned. They did not carefully specify how to compute their dependent variable in the preregistration, and although the reviewer has no experience with the measure that has been used, the dataset contains eight ways in which the dependent variable was calculated. Only one of the eight ways in which the dependent variable yields a significant effect for the polynomial relationship. Faced with this additional information, the reviewer believes it is much more likely that the analysis testing the claim was the result of selective reporting, and now is of the opinion the polynomial relationship was not severely tested.
Both of these evaluations of how severely a hypothesis was tested were perfectly reasonable, given the information reviewer had available. It reveals how sometimes switching from a preregistered analysis to an exploratory analysis does not impact the evaluation of the severity of the test by a reviewer, while in other cases a selectively reported result does reduce the perceived severity with which a claim has been tested. Preregistration makes more information available to readers that can be used to evaluate the severity of a test, but readers might not always evaluate the information in a preregistration in the same way. Whether a design or analytic choice increases or decreases the capability of a claim to be falsified depends on statistical theory, as well as on prior beliefs about the theory that is tested. Some practices are known to reduce the severity of tests, such as optional stopping and selective reporting analyses that yield desired results, and therefore it is easier to evaluate how statistical practices impact the severity with which a claim is tested. If a preregistration is followed through exactly as planned then the tests that are performed have desired error rates in the long run, as long as the test assumptions are met. Note that because long run error rates are based on assumptions about the data generating process, which are never known, true error rates are unknown, and thus preregistration makes it relatively more likely that tests have desired long run error rates. The severity of a tests also depends on assumptions about the underlying theory, and how the theoretical hypothesis is translated into a statistical hypothesis. There will rarely be unanimous agreement on whether a specific operationalization is a better or worse test of a hypothesis, and thus researchers will differ in their evaluation of how severely specific design choices tests a claim. This once more highlights how preregistration does not automatically increase the severity of a test. When it prevents practices that are known to reduce the severity of tests, such as optional stopping, preregistration leads to a relative increase in the severity of a test compared a non-preregistered study. But when there is no objective evaluation of the severity of a test, as is often the case when we try to judge how severe a test was based on theoretical grounds, preregistration merely enables a transparent evaluation of the capability of a claim to be falsified.

Improving Your Statistical Questions

Three years after launching my first massive open online course (MOOC) ‘Improving Your Statistical Inferences’ on Coursera, today I am happy to announce a second completely free online course called ‘Improving Your Statistical Questions’. My first course is a collection of lessons about statistics and methods that we commonly use, but that I wish I had known how to use better when I was taking my first steps into empirical research. My new course is a collection of lessons about statistics and methods that we do not yet commonly use, but that I wish we start using to improve the questions we ask. Where the first course tries to get people up to speed about commonly accepted best practices, my new course tries to educate researchers about better practices. Most of the modules consist of topics in which there has been more recent developments, or at least increasing awareness, over the last 5 years.

About a year ago, I wrote on this blog: If I ever make a follow up to my current MOOC, I will call it ‘Improving Your Statistical Questions’. The more I learn about how people use statistics, the more I believe the main problem is not how people interpret the numbers they get from statistical tests. The real issue is which statistical questions researchers ask from their data. If you approach a statistician to get help with the data analysis, most of their time will be spend asking you ‘but what is your question?’. I hope this course helps to take a step back, reflect on this question, and get some practical advice on how to answer it.
There are 5 modules, with 15 videos, and 13 assignments that provide hands on explanations of how to use the insights from the lectures in your own research. The first week discusses different questions you might want to ask. Only one of these is a hypothesis test, and I examine in detail if you really want to test a hypothesis, or are simply going through the motions of the statistical ritual. I also discuss why NHST is often not a very risky prediction, and why range predictions are a more exciting question to ask (if you can). Module 2 focuses on falsification in practice and theory, including a lecture and some assignments on how to determine the smallest effect size of interest in the studies you perform. I also share my favorite colloquium question for whenever you dozed of and wake up at the end only to find no one else is asking a question, when you can always raise you hand to ask ‘so, what would falsify your hypothesis?’ Module 3 discusses the importance of justifying error rates, a more detailed discussion on power analysis (following up on the ‘sample size justification’ lecture in MOOC1), and a lecture on the many uses of learning how to simulate data. Module 4 moves beyond single studies, and asks what you can expect from lines of research, how to perform a meta-analysis, and why the scientific literature does not look like reality (and how you can detect, and prevent contributing to, a biased literature). I was tempted to add this to MOOC1, but I am happy I didn’t, as there has been a lot of exciting work on bias detection that is now part of the lecture. The last module has three different topics I think are important: computational reproducibility, philosophy of science (this video would also have been a good first video lecture, but I don’t want to scare people away!) and maybe my favorite lecture in the MOOC on scientific integrity in practice. All are accompanied by assignments, and the assignments is where the real learning happens.
If after this course some people feel more comfortable to abandon hypothesis testing and just describe their data, make their predictions a bit more falsifiable, design more informative studies, publish sets of studies that look a bit more like reality, and make their work more computationally reproducible, I’ll be very happy.
The content of this MOOC is based on over 40 workshops and talks I gave in the last 3 years since my previous MOOC came out, testing this material on live crowds. It comes with some of the pressure a recording artist might feel for a second record when their first was somewhat successful. As my first MOOC hits 30k enrolled learners (many of who attend very few of the content, but still with thousands of people taking in a lot of the material) I hope it comes close and lives up to expectations.
I’m very grateful to Chelsea Parlett Pelleriti who checked all assignments for statistical errors or incorrect statements, and provided feedback that made every exercise in this MOOC better. If you need a statistics editor, you can find her at: https://cmparlettpelleriti.github.io/TheChatistician.html. Special thanks to Tim de Jonge who populated the Coursera environment as a student assistant, and Sascha Prudon for recording and editing the videos. Thanks to Uri Simonsohn for feedback on Assignment 2.1, Lars Penke for suggesting the SESOI example in lecture 2.2, Lisa DeBruine for co-developing Assignment 2.4, Joe Hilgard for the PET-PEESE code in assignment 4.3, Matti Heino for the GRIM test example in lecture 4.3, and Michelle Nuijten for feedback on assignment 4.4. Thanks to Seth Green, Russ Zack and Xu Fei at Code Ocean for help in using their platform to make it possible to run the R code online. I am extremely grateful for all alpha testers who provided feedback on early versions of the assignments: Daniel Dunleavy, Robert Gorsch, Emma Henderson, Martine Jansen, Niklas Johannes, Kristin Jankowsky, Cian McGinley, Robert Görsch, Chris Noone, Alex Riina, Burak Tunca, Laura Vowels, and Lara Warmelink, as well as the beta-testers who gave feedback on the material on Coursera: Johannes Breuer, Marie Delacre, Fabienne Ennigkeit, Marton L. Gy, and Sebastian Skejø. Finally, thanks to my wife for buying me six new shirts because ‘your audience has expectations’ (and for accepting how I worked through the summer holiday to complete this MOOC).
All material in the MOOC is shared with a CC-BY-NC-SA license, and you can access all material in the MOOC for free (and use it in your own education). Improving Your Statistical Questions is available from today. I hope you enjoy it!

Requiring high-powered studies from scientists with resource constraints

This blog post is now included in the paper “Sample size justification” available at PsyArXiv. 

Underpowered studies make it very difficult to learn something useful from the studies you perform. Low power means you have a high probability of finding non-significant results, even when there is a true effect. Hypothesis tests which high rates of false negatives (concluding there is nothing, when there is something) become a malfunctioning tool. Low power is even more problematic combined with publication bias (shiny app). After repeated warnings over at least half a century, high quality journals are starting to ask authors who rely on hypothesis tests to provide a sample size justification based on statistical power.
The first time researchers use power analysis software, they typically think they are making a mistake, because the sample sizes required to achieve high power for hypothesized effects are much larger than the sample sizes they collected in the past. After double checking their calculations, and realizing the numbers are correct, a common response is that there is no way they are able to collect this number of observations.
Published articles on power analysis rarely tell researchers what they should do if they are hired on a 4 year PhD project where the norm is to perform between 4 to 10 studies that can cost at most 1000 euro each, learn about power analysis, and realize there is absolutely no way they will have the time and resources to perform high-powered studies, given that an effect size estimate from an unbiased registered report suggests the effect they are examining is half as large as they were led to believe based on a published meta-analysis from 2010. Facing a job market that under the best circumstances is a nontransparent marathon for uncertainty-fetishists, the prospect of high quality journals rejecting your work due to a lack of a solid sample size justification is not pleasant.
The reason that published articles do not guide you towards practical solutions for a lack of resources, is that there are no solutions for a lack of resources. Regrettably, the mathematics do not care about how small the participant payment budget is that you have available. This is not to say that you can not improve your current practices by reading up on best practices to increase the efficiency of data collection. Let me give you an overview of some things that you should immediately implement if you use hypothesis tests, and data collection is costly.
1) Use directional tests where relevant. Just following statements such as ‘we predict X is larger than Y’ up with a logically consistent test of that claim (e.g., a one-sided t-test) will easily give you an increase of 10% power in any well-designed study. If you feel you need to give effects in both directions a non-zero probability, then at least use lopsided tests.
2) Use sequential analysis whenever possible. It’s like optional stopping, but then without the questionable inflation of the false positive rate. The efficiency gains are so great that, if you complain about the recent push towards larger sample sizes without already having incorporated sequential analyses, I will have a hard time taking you seriously.
3) Increase your alpha level. Oh yes, I am serious. Contrary to what you might believe, the recommendation to use an alpha level of 0.05 was not the sixth of the ten commandments – it is nothing more than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify Your Alpha paper as an argument to not require an alpha level of 0.005: “without (1) increased funding, (2) a reward system that values large-scale collaboration and (3) clear recommendations for how to evaluate research with sample size constraints, lowering the significance threshold could adversely affect the breadth of research questions examined.” If you *have* to make a decision, and the data you can feasibly collect is limited, take a moment to think about how problematic Type 1 and Type 2 error rates are, and maybe minimize combined error rates instead of rigidly using a 5% alpha level.
4) Use within designs where possible. Especially when measurements are strongly correlated, this can lead to a substantial increase in power.
5) If you read this blog or follow me on Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and money: Designing statistically powerful studies while minimizing financial costs (link). They discuss I) better ways to screen participants for studies where participants need to be screened before participation, II) assigning participants unequally to conditions (if the control condition is much cheaper than the experimental condition, for example), III) using multiple measurements to increase measurement reliability (or use well-validated measures, if I may add), and IV) smart use of (preregistered, I’d recommend) covariates.
6) If you are really brave, you might want to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably, almost all approaches to statistical inferences become very limited when the number of observations is small. If you are very confident in your predictions (and your peers agree), incorporating prior information will give you a benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and colleagues.
Now if you care about efficiency, you might already have incorporated all these things. There is no way to further improve the statistical power of your tests, and by all plausible estimates of effects sizes you can expect or the smallest effect size you would be interested in, statistical power is low. Now what should you do?
What to do if best practices in study design won’t save you?
The first thing to realize is that you should not look at statistics to save you. There are no secret tricks or magical solutions. Highly informative experiments require a large number of observations. So what should we do then? The solutions below are, regrettably, a lot more work than making a small change to the design of your study. But it is about time we start to take them seriously. This is a list of solutions I see – but there is no doubt more we can/should do, so by all means, let me know your suggestions on twitter or in the comments.
1) Ask for a lot more money in your grant proposals.
Some grant organizations distribute funds to be awarded as a function of how much money is requested. If you need more money to collect informative data, ask for it. Obviously grants are incredibly difficult to get, but if you ask for money, include a budget that acknowledges that data collection is not as cheap as you hoped some years ago. In my experience, psychologists are often asking for much less money to collect data than other scientists. Increasing the requested funds for participant payment by a factor of 10 is often reasonable, given the requirements of journals to provide a solid sample size justification, and the more realistic effect size estimates that are emerging from preregistered studies.
2) Improve management.
If the implicit or explicit goals that you should meet are still the same now as they were 5 years ago, and you did not receive a miraculous increase in money and time to do research, then an update of the evaluation criteria is long overdue. I sincerely hope your manager is capable of this, but some ‘upward management’ might be needed. In the coda of Lakens & Evers (2014) we wrote “All else being equal, a researcher running properly powered studies will clearly contribute more to cumulative science than a researcher running underpowered studies, and if researchers take their science seriously, it should be the former who is rewarded in tenure systems and reward procedures, not the latter.” and “We believe reliable research should be facilitated above all else, and doing so clearly requires an immediate and irrevocable change from current evaluation practices in academia that mainly focus on quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure track, I thought it would be at least principled if I sent this coda to the head of my own department. He replied that the things we wrote made perfect sense, instituted a recommendation to aim for 90% power in studies our department intends to publish, and has since then tried to make sure quality, and not quantity, is used in evaluations within the faculty (as you might have guessed, I am not on the job market, nor do I ever hope to be).
3) Change what is expected from PhD students.
When I did my PhD, there was the assumption that you performed enough research in the 4 years you are employed as a full-time researcher to write a thesis with 3 to 5 empirical chapters (with some chapters having multiple studies). These studies were ideally published, but at least publishable. If we consider it important for PhD students to produce multiple publishable scientific articles during their PhD’s, this will greatly limit the types of research they can do. Instead of evaluating PhD students based on their publications, we can see the PhD as a time where researchers learn skills to become an independent researcher, and evaluate them not based on publishable units, but in terms of clearly identifiable skills. I personally doubt data collection is particularly educational after the 20th participant, and I would probably prefer to  hire a post-doc who had well-developed skills in programming, statistics, and who broadly read the literature, then someone who used that time to collect participant 21 to 200. If we make it easier for PhD students to demonstrate their skills level (which would include at least 1 well written article, I personally think) we can evaluate what they have learned in a more sensible manner than now. Currently, difference in the resources PhD students have at their disposal are a huge confound as we try to judge their skill based on their resume. Researchers at rich universities obviously have more resources – it should not be difficult to develop tools that allow us to judge the skills of people where resources are much less of a confound.
4) Think about the questions we collectively want answered, instead of the questions we can individually answer.
Our society has some serious issues that psychologists can help address. These questions are incredibly complex. I have long lost faith in the idea that a bottom-up organized scientific discipline that rewards individual scientists will manage to generate reliable and useful knowledge that can help to solve these societal issues. For some of these questions we need well-coordinated research lines where hundreds of scholars work together, pool their resources and skills, and collectively pursuit answers to these important questions. And if we are going to limit ourselves in our research to the questions we can answer in our own small labs, these big societal challenges are not going to be solved. Call me a pessimist. There is a reason we resort to forming unions and organizations that have to goal to collectively coordinate what we do. If you greatly dislike team science, don’t worry – there will always be options to make scientific contributions by yourself. But now, there are almost no ways for scientists who want to pursue huge challenges in large well-organized collectives of hundreds or thousands of scholars (for a recent exception that proves my rule by remaining unfunded: see the Psychological Science Accelerator). If you honestly believe your research question is important enough to be answered, then get together with everyone who also thinks so, and pursue answeres collectively. Doing so should, eventually (I know science funders are slow) also be more convincing as you ask for more resources to do the resource (as in point 1).
If you are upset that as a science we lost the blissful ignorance surrounding statistical power, and are requiring researchers to design informative studies, which hits substantially harder in some research fields than in others: I feel your pain. I have argued against universally lower alpha levels for you, and have tried to write accessible statistics papers that make you more efficient without increasing sample sizes. But if you are in a research field where even best practices in designing studies will not allow you to perform informative studies, then you need to accept the statistical reality you are in. I have already written too long a blog post, even though I could keep going on about this. My main suggestions are to ask for more money, get better management, change what we expect from PhD students, and self-organize – but there is much more we can do, so do let me know your top suggestions. This will be one of the many challenges our generation faces, but if we manage to address it, it will lead to a much better science.

The New Heuristics

You can derive the age of a researcher based on the sample size they were told to use in a two independent group design. When I started my PhD, this number was 15, and when I ended, it was 20. This tells you I did my PhD between 2005 and 2010. If your number was 10, you have been in science much longer than I have, and if your number is 50, good luck with the final chapter of your PhD.
All these numbers are only sporadically the sample size you really need. As with a clock stuck at 9:30 in the morning, heuristics are sometimes right, but most often wrong. I think we rely way too often on heuristics for all sorts of important decisions we make when we do research. You can easily test whether you rely on a heuristic, or whether you can actually justify a decision you make. Ask yourself: Why?
I vividly remember talking to a researcher in 2012, a time where it started to become clear that many of the heuristics we relied on were wrong, and there was a lot of uncertainty about what good research practices looked like. She said: ‘I just want somebody to tell me what to do’. As psychologists, we work in a science where the answer to almost every research question is ‘it depends’. It should not be a surprise the same holds for how you design a study. For example, Neyman & Pearson (1933) perfectly illustrate how a statistician can explain the choices that need to be made, but in the end, only the researcher can make the final decision:
Due to a lack of training, most researchers do not have the skills to make these decisions. They need help, but do not even always have access to someone who can help them. It is therefore not surprising that articles and books that explain how to use useful tool provide some heuristics to get researchers started. An excellent example of this is Cohen’s classic work on power analysis. Although you need to think about the statistical power you want, as a heuristic, a minimum power of 80% is recommended. Let’s take a look at how Cohen (1988) introduces this benchmark.
It is rarely ignored. Note that we have a meta-heuristic here. Cohen argues a Type 1 error is 4 times as serious as a Type 2 error, and the Type 1 error is at 5%. Why? According to Fisher (1935) because it is a ‘convenient convention’. We are building a science on heuristics built on heuristics.
There has been a lot of discussion about how we need to improve psychological science in practice, and what good research practices look like. In my view, we will not have real progress when we replace old heuristics by new heuristics. People regularly complain to me about people who use what I would like to call ‘The New Heuristics’ (instead of The New Statistics), or ask me to help them write a rebuttal to a reviewer who is too rigidly applying a new heuristic. Let me give some recent examples.
People who used optional stopping in the past, and have learned this is p-hacking, think you can not look at the data as it comes in (you can, when done correctly, using sequential analyses, see Lakens, 2014). People make directional predictions, but test them with two-sided tests (even when you can pre-register your directional prediction). They think you need 250 participants (as an editor of a flagship journal claimed), even though there is no magical number that leads to high enough accuracy. They think you always need to justify sample sizes based on a power analysis (as a reviewer of a grant proposal claimed when rejecting a proposal) even though there are many ways to justify sample sizes. They argue meta-analysis is not a ‘valid technique’ only because the meta-analytic estimate can be biased (ignoring meta-analyses have many uses, including an analysis of heterogeneity, and all tests can be biased). They think all research should be preregistered or published as Registered Reports, even when the main benefit (preventing inflation of error rates for hypothesis tests due to flexibility in the data analysis) is not relevant for all research psychologists do. They think p-values are invalid and should be removed from scientific articles, even when in well-designed controlled experiments they might be the outcome of interest, especially early on in new research lines. I could go on.
Change is like a pendulum, swinging from one side to the other of a multi-dimensional space. People might be too loose, or too strict, too risky, or too risk-averse, too sexy, or too boring. When there is a response to newly identified problems, we often see people overreacting. If you can’t justify your decisions, you will just be pushed from one extreme on one of these dimensions to the opposite extreme. What you need is the weight of a solid justification to be able to resist being pulled in the direction of whatever you perceive to be the current norm. Learning The New Heuristics (for example setting the alpha level to 0.005 instead of 0.05) is not an improvement – it is just a change.
If we teach people The New Heuristics, we will get lost in the Bog of Meaningless Discussions About Why These New Norms Do Not Apply To Me. This is a waste of time. From a good justification it logically follows whether something applies to you or not. Don’t discuss heuristics – discuss justifications.
‘Why’ questions come at different levels. Surface level ‘why’ questions are explicitly left to the researcher – no one else can answer them. Why are you collecting 50 participants in each group? Why are you aiming for 80% power? Why are you using an alpha level of 5%? Why are you using this prior when calculating a Bayes factor? Why are you assuming equal variances and using Student’s t-test instead of Welch’s t-test? Part of the problem I am addressing here is that we do not discuss which questions are up to the researcher, and which are questions on a deeper level that you can simply accept without needing to provide a justification in your paper. This makes it relatively easy for researchers to pretend some ‘why’ questions are on a deeper level, and can be assumed without having to be justified. A field needs a continuing discussion about what we expect researchers to justify in their papers (for example by developing improved and detailed reporting guidelines). This will be an interesting discussion to have. For now, let’s limit ourselves to surface level questions that were always left up to researchers to justify (even though some researchers might not know any better than using a heuristic). In the spirit of the name of this blog, let’s focus on 20% of the problems that will improve 80% of what we do.
My new motto is ‘Justify Everything’ (it also works as a hashtag: #JustifyEverything). Your first response will be that this is not possible. You will think this is too much to ask. This is because you think that you will have to be able to justify everything. But that is not my view on good science. You do not have the time to learn enough to be able to justify all the choices you need to make when doing science. Instead, you could be working in a team of as many people as you need so that within your research team, there is someone who can give an answer if I ask you ‘Why?’. As a rule of thumb, a large enough research team in psychology has between 50 and 500 researchers, because that is how many people you need to make sure one of the researchers is able to justify why research teams in psychology need between 50 and 500 researchers.
Until we have transitioned into a more collaborative psychological science, we will be limited in how much and how well we can justify our decisions in our scientific articles. But we will be able to improve. Many journals are starting to require sample size justifications, which is a great example of what I am advocating for. Expert peer reviewers can help by pointing out where heuristics are used, but justifications are possible (preferably in open peer review, so that the entire community can learn). The internet makes it easier than ever before to ask other people for help and advice. And as with anything in a job as difficult as science, just get started. The #Justify20% hashtag will work just as well for now.

Does Your Philosophy of Science Matter in Practice?

In my personal experience philosophy of science rarely directly plays a role in how most scientists do research. Here I’d like to explore ways in which your philosophy of science might slightly shift how you weigh what you focus on when you do research. I’ll focus on different philosophies of science (instrumentalism, constructive empiricism, entity realism, and scientific realism), and explore how it might impact what you see as the most valuable way to make progress in science, how much you value theory-driven or data-driven research, and whether you believe your results should be checked against reality.

We can broadly distinguish philosophies of science (following Niiniluoto, 1999) in three main categories. First, there is the view that there is no truth, known as anarchism. An example of this can be found in Paul Feyerabend’s ‘Against Method’ where he writes: Science is an essentially anarchic enterprise: theoretical anarchism is more humanitarian and more likely to encourage progress than its law-and-order alternatives” and “The only principle that does not inhibit progress is: anything goes.” The second category contains pragmatism, in which ‘truth’ is replaced by some surrogate, such as social consensus. For example, Pierce (1878) writes: “The opinion which is fated to be ultimately agreed to by all who investigate, is what we mean by the truth, and the object represented in this opinion is the real. That is the way I would explain reality.”. Rorty doubts such a final end-point of consensus can ever be reached, and suggests giving up on the concept of truth and talk about an indefinite adjustment of belief. The third category, which we will focus on mostly below, consists of approaches that define truth as some correspondence between language and reality, known as correspondence theories. In essence, these approaches adhere to a dictionary definition of truth as ‘being in accord with fact or reality’. However, these approaches differ in whether they believe scientific theories have a truth value (i.e., whether theories can be true or false), and if theories have truth value, whether this is relevant for scientific practice.

What do you think? Is anarchy the best way to do science? Is there no truth, but at best an infinite updating of belief with some hope of social consensus? Or is there some real truth that we can get closer to over time?
Scientific Progress and Goals of Science
It is possible to have different philosophies of science because success in science is not measured by whether we discover the truth. After all, how would we ever know for sure we discovered the truth? Instead, a more tangible goal for science is to make scientific progress. This means we can leave philosophical discussions about what truth is aside, but it means we will have to define what progress in science looks like. And to have progress in science, science needs to have a goal.
Kitcher (1993, chapter 4) writes: “One theme recurs in the history of thinking about the goals of science: science ought to contribute to “the relief of man’s estate,” it should enable us to control nature—or perhaps, where we cannot control, to predict, and so adjust our behavior to an uncooperative world—it should supply the means for improving the quality and duration of human lives, and so forth.” Truth alone is not a sufficient aim for scientific progress, as Popper (1934/1959) already noted, because then we would just limit ourselves to positing trivial theories (e.g., for psychological science the theory that ‘it depends’), or collect detailed but boring information (the temperature in the room I am now in is 19.3 degrees). Kitcher highlights two important types of progress: conceptual progress and explanatory progress. Conceptual progress comes from refining the concepts we talk about, such that we can clearly specify these concepts, and preferably reach consensus about them. Explanatory progress is improved by getting a better understanding of the causal mechanisms underlying phenomena. Scientists will probably recognize the need for both. We need to clearly define our concepts, and know how to measure them, and we often want to know how things are related, or how to manipulate things.
This distinction between conceptual progress and explanatory progress aligns roughly with a distinction about progress with respect to the entitieswe study, and the theories we build that explain how these entities are related. A scientific theory is defined as a set of testable statements about the relation between observations. As noted before, philosophies of science differ in whether they believe statements about entities and theories are related to the truth, and if they are, whether this matters for how we do science. Let’s discuss four flavors of philosophies of science that differ in how much value they place in whether the way we talk about theories and entities corresponds to an objective truth.
According to instrumentalism, theories should be seen mainly as tools to solve practical problems, and not as truthful descriptions of the world. Theories are instruments that generate predictions about things we can observe. Theories often refer to unobservable entities, but these entities do not have truth or falsity, and neither do the theories. Scientific theories should not be evaluated based on whether they correspond to the true state of the world, but based on how well they perform.
One important reason to suspend judgment about whether theories are true or false is because of underdetermination (for an explanation, see Ladyman, 2002). We often do not have enough data to distinguish different possible theories. If it really isn’t possible to distinguish different theories because we would not be able to collect the required data, it is often difficult to say whether one theory is closer to the truth than another theory.
From an instrumentalist view on scientific progress, and assuming that all theories are underdetermined by data, additional criteria to evaluate theories become important, such as simplicity. Researchers might use approximations to make theories easier to implement, for example in computational models, based on the convictions that simpler theories provide more useful instruments, even if they are slightly less accurate about the true state of the world.
Constructive Empiricism
As opposed to instrumentalism, constructive empiricism acknowledges that theories can be true or not. However, it limits belief in theories only in as far as they describe observable events. Van Fraassen, one of the main proponents of constructive empiricism, suggests we can use a theory without believing it is true when it is empirically adequate. He says: “a theory is empirically adequate exactly if what it says about the observable things and events in the world, is true”. Constructive empiricists might decide to use a theory, but do not have to believe it is true. Theories often make statements that go beyond what we can observe, but constructive empiricists limit truth statements to observable entities. Because no truth values are ascribed to unobservable entities that are assumed to exist in the real world, this approach is grouped under ‘anti-realist’ philosophies on science.
Entity Realism
Entity realists are willing to take one step beyond constructive empiricism and acknowledge a belief in unobservable entities when a researcher can demonstrate impressive causal knowledge of an unobservable) entity. When knowledge about an unobservable entity can be used to manipulate its behavior, or if knowledge about the entity can be used to manipulate other phenomena, one can believe that it is real. However, researchers remain skeptical about scientific theories.
Hacking (1982) writes, in a very accessible article, how: “The vast majority of experimental physicists are realists about entities without a commitment to realism about theories. The experimenter is convinced of the existence of plenty of “inferred” and “unobservable” entities. But no one in the lab believes in the literal truth of present theories about those entities. Although various properties are confidently ascribed to electrons, most of these properties can be embedded in plenty of different inconsistent theories about which the experimenter is agnostic.” Researchers can be realists about entities, but anti-realists about models.
Scientific Realism
We can compare the constructive empiricist and entity realist views with scientific realism. For example, Niiniluoto (1999) writes that in contrast to a constructive empiricist:
“a scientific realist sees theories as attempts to reveal the true nature of reality even beyond the limits of empirical observation. A theory should be cognitively successful in the sense that the theoretical entities it postulates really exist and the lawlike descriptions of these entities are true. Thus, the basic aim of science for a realist is true information about reality. The realist of course appreciates empirical success like the empiricist. But for the realist, the truth of a theory is a precondition for the adequacy of scientific explanations.”
For scientific realists, verisimilitude, or ‘truthlikeness’ is treated as the basic epistemic utility of science. It is based on the empirical success of theories. As De Groot (1969) writes: “The criterion par excellence of true knowledge is to be found in the ability to predict the results of a testing procedure. If one knows something to be true, he is in a position to predict; where prediction is impossible, there is no knowledge.” Failures to predict are thus very impactful for a scientific realist.
Progress in Science
There are more similarities than differences between almost all philosophies of science. All approaches believe a goal of science is progress. Anarchists refrain from specifying what progress looks like. Feyerabend writes: “my thesis is that anarchism helps to achieve progress in any one of the senses one care to choose” – but progress is still a goal of science. For instrumentalists, the proof is in the pudding – theories are good, as long they lead to empirical progress, regardless of whether these theories are true. For a scientific realist, theories are better the closer the more verisimilitude they have, or the closer the get to an unknown truth. For all approaches (except perhaps anarchism) conceptual progress and explanatory progress are valued.
Conceptual progress is measured by increased accuracy in how a concept is measured, and increased consensuson what is measured. Progress concerning measurement accuracy is easily demonstrated since it is mainly dependent on the amount of data that is collected, and can be quantified by the standard error of the measurement. Consensus is perhaps less easily demonstrated, but Meehl (2004) provides some suggestions, such as a theory being generally talked about as a ‘fact’, research and technological applications use the theory but there is no need to study it directly anymore, and the only discussions of the theory at scientific meetings are as in panels about history or celebrations of past successes. We then wait for (and arguably arbitrary) 50 years to see if there is any change, and if not, we consider the theory accepted by consensus. Although Meehl acknowledged this is a somewhat brute-force approach to epistemology, he believes philosophers of science should be less distracted by exceptions such as Newtonian physics that was overthrown after 200 years, and acknowledge something like his approach will probably work in practice most of the time.
Explanatory progress is mainly measured by our ability to predict novel facts. Whether prediction(showing a theoretical prediction is supported by data) should be valued more than accommodation (adjusting a theory to accommodate unexpected observations) is a matter of debate. Some have argued that it doesn’t matter if a theory is stated before data is observed or after data is observed. Keynes writes: “The peculiar virtue of prediction or predesignation is altogether imaginary. The number of instances examined and the analogy between them are the essential points, and the question as to whether a particular hypothesis happens to be propounded before or after their examination is quite irrelevant.” It seems as if Keynes dismisses practices such as pre-registration, but his statement comes with a strong caveat, namely that researchers are completely unbiased. He writes: “to approach statistical evidence without preconceptions based on general grounds, because the temptation to ‘cook’ the evidence will prove otherwise to be irresistible, has no logical basis and need only be considered when the impartiality of an investigator is in doubt.”
Keynes’ analysis of prediction versus accommodation is limited to the evidence in the data. However, Mayo (2018) convincingly argues we put more faith in predicted findings than accommodated findings because the former have passed a severe test. If data is used when generating a hypothesis (i.e., the hypothesis has no use-novelty) the hypothesis will fit the data, no matter whether the theory is true or false. It is guaranteed to match the data, because the theory was constructed with this aim. A theory that is constructed based on the data has not passed a severe test. When novel data is collected in a well-constructed experiment, a hypothesis is unlikely to pass a test (e.g., yield a significant result) if the hypothesis is false. The strength from not using the data when constructing a hypothesis comes from the fact that is has passed a more severe test, and had a higher probability to be proven wrong (but wasn’t).
Does Your Philosophy of Science Matter?
Even if scientists generally agree that conceptual progress and explanatory progress are valuable, and that explanatory progress can be demonstrated by testing theoretical predictions, your philosophy of science likely influences how much you weigh the different questions researchers ask when they do scientific research. Research can be more theory driven, or more exploratory, and it seems plausible your views on which you value more is in part determined by your philosophy of science.
For example, do you perform research by formalizing strict theoretical predictions, and collect data to corroborate or falsify these predictions to increase the verisimilitude of the theory? Or do you largely ignore theories in your field, and aim to accurately measure relationships between variables? Developing strong theories can be useful for a scientific field, because they facilitate the organization of known phenomena, help to predict what will happen in new situations, and guide new research. Collecting reliable information about phenomena can provide the information needed to make decisions, and provides important empirical information that can be used to develop theories.
For a scientific realist a main aim is to test whether theories reflect reality. Scientific research starts with specifying a falsifiable theory. The goal of an experiment is to test the theory. If the theory passes the test, the theory gains verisimilitude, if it fails a test, it loses verisimilitude, and needs to be adjusted. If a theory repeatedly fails to make predictions (what Lakatos calls a degenerative research line) it is eventually abandoned. If the theory proves successful in making predictions, it becomes established knowledge.
For an entity realist like Hacking (1982), experiments provide knowledge about entities, and therefore experiments determine what we believe, not theories. He writes: “Hence, engineering, not theorizing, is the proof of scientific realism about entities.” Van Fraassen similarly stresses the importance of experiments, which are crucial in establishing facts about observable phenomena. He sees a role for theory, but it is quite different of the role it plays in scientific realism. Van Fraassen writes: “Scientists aim to discover facts about the world—about the regularities in the observable part of the world. To discover these, one needs experimentation as opposed to reason and reflection. But those regularities are exceedingly subtle and complex, so experimental design is exceedingly difficult. Hence the need for the construction of theories, and for appeal to previously constructed theories to guide the experimental inquiry.”
Theory-driven versus data-driven
One might be tempted to align philosophies of science along a continuum of how strongly theory driven they are (or confirmatory), and how strongly data-driven they are (or exploratory). Indeed, Van Fraassen writes: “The phenomenology of scientific theoretical advance may indeed be exactly like the phenomenology of exploration and discovery on the Dark Continent or in the South Seas, in certain respects.” Note that exploratory data-driven research is not void of theory – but the role theories play has changed. There are two roles, according to Fraassen. First, the outcome of an experiment is ‘filling in the blanks in a developing theory’. The second role theories play is in that, as the regularities we aim to uncover become more complex, we need theory to guide experimental design. Often a theory states there must be something, but it is very unclear what this something actually is.
For example, a theory might predict there are individual differences, or contextual moderators, but the scientist needs to discover which individual differences, or what contextual moderators. In this instance, the theory has many holes in it that need to be filled. As scientists fill in the blanks, there are typically new consequences that can be tested. As Fraassen writes: “This is how experimentation guides the process of theory construction, while at the same time the part of the theory that has already been constructed guides the design of the experiments that will guide the continuation”. For example, if we learn that as expected individual differences moderate an effect, and the effect is more pronounced for older compared to younger individuals, these experimental results guide theory construction. Fraassen goes as far as to say that “experimentation is the continuation of theory construction by other means.
For a scientific realist experimentation has the main goal to test theories, not to construct them. Exploration is still valuable but is less prominent in scientific realism. If a theory is at the stage where it predicts something will happen, but is not specific about what this something is, it is difficult to come up with a result that would falsify that prediction (except for cases where it is plausible that nothing would happen, which might be limited to highly controlled randomized experiments). Scientific realism requires well-specified theories. When data do not support theoretical predictions, this should be consequential. It means a theory is less ‘truth-like’ than we thought before.
Subjective or Objective Inferences?
Subjective beliefs in a theory or hypothesis play an important role in science. Beliefs are likely to have strong motivational power, leading scientists to invest time and effort in examining the things they examine. It has been a matter of debate whether subjective beliefs should play a role in the evaluation of scientific facts.
Both Fisher (1935) as Popper (1934/1959) disapproved of introducing subjective probabilities into statistical inferences. Fisher writes: “advocates of inverse probability seem forced to regard mathematical probability, not as an objective quantity measured by observed frequencies, but as measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes.” Popper writes: “We must distinguish between, on the one hand, our subjective experiences or our feelings of conviction, which can never justify any statement (though they can be made the subject of psychological investigation) and, on the other hand, the objective logical relations subsisting among the various systems of scientific statements, and within each of them.” For Popper, objectivity does not reside in theories, which he believes are never verifiable, but in tests of theories: “the objectivity of scientific statements lies in the fact that they can be inter-subjectively tested.”
This concern cuts across statistical approaches. Taper and Lele (2011) who are likelihoodists write: We dismiss Bayesianism for its use of subjective priors and a probability concept that conceives of probability as a measure of personal belief.” They continue: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” Their approach seems closest to a scientific realism perspective. Although they acknowledge all models are false, they also write: “some models are better approximations of reality than other models” and “we believe that growth in scientific knowledge can be seen as the continual replacement of current models with models that approximate reality more closely.”
Gelman and Shalizi, who use Bayesian statistics but dislike subjective Bayes, write: “To reiterate, it is hard to claim that the prior distributions used in applied work represent statisticians’ states of knowledge and belief before examining their data, if only because most statisticians do not believe their models are true, so their prior degree of belief in all of ? [the parameter space used to generate a model] is not 1 but 0”. Although subjective Bayesians would argue that having no belief that your models are true is too dogmatic (it means you would never be convinced otherwise, regardless of how much data is collected) it is not unheard of in practice. Physicists know the Standard Model is wrong, but it works. This means they assign a probability of 0 to the standard model being true – which violates one of the core assumptions of Bayesian inference (which is that the plausibility assigned to a hypothesis can be represented as a number between 0 and 1). Gelman and Shalizi approach statistical inferences from a philosophy of science perhaps closest to constructive empiricism when they write: “Either way, we are using deductive reasoning as a tool to get the most out of a model, and we test the model – it is falsifiable, and when it is consequentially falsified, we alter or abandon it.”  
Both Taper and Lele (2011) as Gelman and Shalizi (2013) stress that models should be tested against reality. Taper and Lele want procedures that are reliable (i.e., are unlikely to yield incorrect conclusions in the long run), and that provide good evidence (i.e., when the data is in, it should provide strong relative support for one model over another). They write: “We strongly believe that one of the foundations of effective epistemology is some form of reliabilism. Under reliabilism, a belief (or inference) is justified if it is formed from a reliable process.” Similarly, Gelman and Shalizi write: “the hypothesis linking mathematical models to empirical data is not that the data-generating process is exactly isomorphic to the model, but that the data source resembles the model closely enough, in the respects which matter to us, that reasoning based on the model will be reliable.”
As an example of an alternative viewpoint, we can consider a discussion about whether optional stopping (repeatedly analyzing data and stopping the data analysis whenever the data supports predictions) is problematic or not. In Frequentist statistics the practice of optional stopping inflates the error rate (and thus, to control the error rate, the alpha level needs to be adjusted when sequential analyses are performed). Rouder (2014) believes optional stopping is no problem for Bayesians. He writes: “In my opinion, the key to understanding Bayesian analysis is to focus on the degree of belief for considered models, which need not and should not be calibrated relative to some hypothetical truth.” It is the responsibility of researchers to choose models they are interested in (although how this should be done is still a matter of debate). Bayesian statistics allows researchers to update their belief concerning these models based on the data – irrespective of whether these models have a relation with reality. Although optional stopping increases error rates (for an excellent discussion, see Mayo, 2018), and this reduces the severity of the tests the hypotheses pass (which is why researchers worry about such practices undermining reproducibility) such concerns are not central to a subjective Bayesian approach to statistical inferences.
Can You Pick Only One Philosophy of Science?
Ice-cream stores fare well selling cones with more than one scoop. Sure, it can get a bit messy, but sometimes choosing is just too difficult. When it comes to philosophy of science, do you need to pick one approach and stick to it for all problems? I am not sure. Philosophers seem to implicitly suggest this (or at least they don’t typically discuss the pre-conditions to adopt their proposed philosophy of science, and seem to imply their proposal generalizes across fields and problems within fields).
Some viewpoints (such as whether there is a truth or not, and if theories have some relation to truth) seem rather independent of the research context. It is still fine to change your view over time (philosophers of science themselves change their opinion over time!) but they are probably somewhat stable.
Other viewpoints seem to leave more room for flexibility depending on the research you are doing. You might not believe theories in a specific field are good enough to be used as anything but crude verbal descriptions of phenomena. I teach an introduction to psychology course to students at the Eindhoven Technical University, and near the end of one term a physics student approached me after class and said: “You very often use the word ‘theory’, but many of these ‘theories’ don’t really sound like theories”. If you have ever tried to create computational models of psychological theories, you will have experienced it typically cannot be done: Theories lack sufficient detail. Furthermore, you might feel the concepts used in your research area are not specified enough to really know what we are talking about. If this is the case, you might not have the goal to test theories (or try to explain phenomena) but mainly want to focus on conceptual progress by improving measurement techniques or accurately estimate their effect sizes. Or you might work on more applied problems and believe that a specific theory is just a useful instrument that guides you towards possibly interesting questions, but is not in itself something that can be tested, or that accurately describes reality.
Researchers often cite and use theories in their research, but they are rarely explicit about what these theories mean to them. Do you believe a theory reflects some truth in the world, or are they just useful instruments to guide research that should not be believed to be true? Is the goal of your research to test theories, or to construct theories? Do you have a strong belief that the unobservable entities you are studying are real, or do you prefer to limit your belief to statements about things you can directly observe? Being clear about where you stand with respect to these questions might make it clear what different scientists expect scientific progress should look like and clarify what their goals are when they collect data. It might explain differences in how people respond when a theoretical prediction is not confirmed, or why some researchers prefer to accurately measure the entities they study, while others prefer to test theoretical predictions.
Feyerabend, P. (1993). Against method. London: Verso.
Fisher, R. A. (1935). The design of experiments. Oliver And Boyd; London.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10/f4k2h4
Hacking, I. (1982). Experimentation and Scientific Realism. Philosophical Topics, 13(1), 71–87. https://doi.org/10/fz8ftm
Keynes, J. M. (1921). A Treatise on Probability. Cambridge University Press.
Kitcher, P. (1993). The advancement of science: science without legend, objectivity without illusions. New York: Oxford University Press.
Ladyman, J. (2002). Understanding philosophy of science. London?; New York: Routledge.
Mayo, D. G. (2018). Statistical inference as severe testing: how to get beyond the statistics wars. Cambridge: Cambridge University Press.
Meehl, P. E. (2004). Cliometric metatheory III: Peircean consensus, verisimilitude and asymptotic method. The British Journal for the Philosophy of Science, 55(4), 615–643.
Niiniluoto, I. (1999). Critical Scientific Realism. Oxford University Press.
Popper, K. R. (1959). The logic of scientific discovery. London; New York: Routledge.
Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301–308.
Taper, M. L., & Lele, S. R. (2011). Evidence, Evidence Functions, and Error Probabilities. In P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of Statistics (Vol. 7, pp. 513–532). Amsterdam: North-Holland. https://doi.org/10.1016/B978-0-444-51862-0.50015-0