Jerzy Neyman: A Positive Role Model in the History of Frequentist Statistics

Many of the facts in this blog post come from the biography ‘Neyman’ by Constance Reid. I highly recommend reading this book if you find this blog interesting.

In recent years researchers have become increasingly interested in the relationship between eugenics and statistics, especially focusing on the lives of Francis Galton, Karl Pearson, and Ronald Fisher. Some have gone as far as to argue for a causal relationship between eugenics and frequentist statistics. For example, in a recent book ‘Bernouilli’s Fallacy’, Aubrey Clayton speculates that Fisher’s decision to reject prior probabilities and embrace a frequentist approach was “also at least partly political”. Rejecting prior probabilities, Clayton argues, makes science seem more ‘objective’, which would have helped Ronald Fisher and his predecessors to establish eugenics as a scientific discipline, despite the often-racist conclusions eugenicists reached in their work.

When I was asked to review an early version of Clayton’s book for Columbia University Press, I thought that the main narrative was rather unconvincing, and thought the presented history of frequentist statistics was too one-sided and biased. Authors who link statistics to problematic political views often do not mention equally important figures in the history of frequentist statistics who were in all ways the opposite of Ronald Fisher. In this blog post, I want to briefly discuss the work and life of Jerzy Neyman, for two reasons.


Jerzy Neyman (image from https://statistics.berkeley.edu/people/jerzy-neyman)

First, the focus on Fisher’s role in the history of frequentist statistics is surprising, given that the dominant approach to frequentist statistics used in many scientific disciplines is the Neyman-Pearson approach. If you have ever rejected a null hypothesis because a p-value was smaller than an alpha level, or if you have performed a power analysis, you have used the Neyman-Pearson approach to frequentist statistics, and not the Fisherian approach. Neyman and Fisher disagreed vehemently about their statistical philosophies (in 1961 Neyman published an article titled ‘Silver Jubilee of My Dispute with Fisher’), but it was Neyman’s philosophy that won out and became the default approach to hypothesis testing in most fields[i]. Anyone discussing the history of frequentist hypothesis testing should therefore seriously engage with the work of Jerzy Neyman and Egon Pearson. Their work was not in line with the views of Karl Pearson, Egon’s father, nor the views of Fisher. Indeed, it was a great source of satisfaction to Neyman that their seminal 1933 paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of the work, and (as Neyman thought) reviewed by Fisher[ii], who strongly disagreed with their philosophy of statistics.

Second, Jerzy Neyman was also the opposite to Fisher in his political viewpoints. Instead of promoting eugenics, Neyman worked to improve the position of those less privileged throughout his life, teaching disadvantaged people in Poland, and creating educational opportunities for Americans at UC Berkeley. He hired David Blackwell, who was the first Black tenured faculty member at UC Berkeley. This is important, because it falsifies the idea put forward by Clayton[iii]that frequentist statistics became the dominant approach in science because the most important scientists who worked on it wanted to pretend their dubious viewpoints were based on ‘objective’ scientific methods.  

I think it is useful to broaden the discussion of the history of statistics, beyond the work by Fisher and Karl Pearson, and credit the work of others[iv]who contributed in at least as important ways to the statistics we use today. I am continually surprised about how few people working outside of statistics even know the name of Jerzy Neyman, even though they regularly use his insights when testing hypotheses. In this blog, I will try to describe his work and life to add some balance to the history of statistics that most people seem to learn about. And more importantly, I hope Jerzy Neyman can be a positive role-model for young frequentist statisticians, who might so far have only been educated about the life of Ronald Fisher.


Neyman’s personal life


Neyman was born in 1984 in Russia, but raised in Poland. After attending the gymnasium, he studied at the University of Kharkov. Initially trying to become an experimental physicist, he was too clumsy with his hands, and switched to conceptual mathematics, in which he concluded his undergraduate in 1917 in politically tumultuous times. In 1919 he met his wife, and they marry in 1920. Ten days later, because of the war between Russia and Poland, Neyman is imprisoned for a short time, and in 1921 flees to a small village to avoid being arrested again, where he obtains food by teaching the children of farmers. He worked for the Agricultural Institute, and then worked at the University in Warsaw. He obtained his doctor’s degree in 1924 at age 30. In September 1925 he was sent to London for a year to learn about the latest developments in statistics from Karl Pearson himself. It is here that he met Egon Pearson, Karl’s son, and a friendship and scientific collaboration starts.

Neyman always spends a lot of time teaching, often at the expense of doing scientific work. He was involved in equal opportunity education in 1918 in Poland, teaching in dimly lit classrooms where the rag he used to wipe the blackboard would sometimes freeze. He always had a weak spot for intellectuals from ‘disadvantaged’ backgrounds. He and his wife were themselves very poor until he moved to UC Berkeley in 1938. In 1929, back in Poland, his wife becomes ill due to their bad living conditions, and the doctor who comes to examine her is so struck by their miserable living conditions he offers the couple stay in his house for the same rent they were paying while he visits France for 6 months. In his letters to Egon Pearson from this time, he often complained that the struggle for existence takes all his time and energy, and that he can not do any scientific work.

Even much later in his life, in 1978, he kept in mind that many people have very little money, and he calls ahead to restaurants to make sure a dinner before a seminar would not cost too much for the students. It is perhaps no surprise that most of his students (and he had many) talk about Neyman with a lot of appreciation. He wasn’t perfect (for example, Erich Lehmann – one of Neyman’s students – remarks how he was no longer allowed to teach a class after Lehmann’s notes, building on but extending the work by Neyman, became extremely popular – suggesting Neyman was no stranger to envy). But his students were extremely positive about the atmosphere he created in his lab. For example, job applicants were told around 1947 that “there is no discrimination on the basis of age, sex, or race … authors of joint papers are always listed alphabetically.”

Neyman himself often suffered discrimination, sometimes because of his difficulty mastering the English language, sometimes for being Polish (when in Paris a piece of clothing, and ermine wrap, is stolen from their room, the police responds “What can you expect – only Poles live there!”), sometimes because he did not believe in God, and sometimes because his wife was Russian and very emancipated (living independently in Paris as an artist). He was fiercely against discrimination. In 1933, as anti-Semitism is on the rise among students at the university where he works in Poland, he complains to Egon Pearson in a letter that the students are behaving with Jews as Americans do with people of color. In 1941 at UC Berkeley he hired women at a time it was not easy for a woman to get a job in mathematics.  

In 1942, Neyman examined the possibility of hiring David Blackwell, a Black statistician, then still a student. Neyman met him in New York (so that Blackwell does not need to travel to Berkeley at his own expense) and considered Blackwell the best candidate for the job. The wife of a mathematics professor (who was born in the south of the US) learned about the possibility that a Black statistician might be hired, warns she will not invite a Black man to her house, and there was enough concern for the effect the hire would have on the department that Neyman can not make an offer to Blackwell. He is able to get Blackwell to Berkeley in 1953 as a visiting professor, and offers him a tenured job in 1954, making David Blackwell the first tenured faculty member at the University of Berkeley, California. And Neyman did this, even though Blackwell was a Bayesian[v];).

In 1963, Neyman travelled to the south of the US and for the first time directly experienced the segregation. Back in Berkeley, a letter is written with a request for contributions for the Southern Christian Leadership Conference (founded by Martin Luther King, Jr. and others), and 4000 copies are printed and shared with colleagues at the university and friends around the country, which brought in more than $3000. He wrote a letter to his friend Harald Cramér that he believed Martin Luther King, Jr. deserved a Nobel Peace Prize (which Cramér forwarded to the chairman of the Nobel Committee, and which he believed might have contributed at least a tiny bit to fact that Martin Luther King, Jr. was awarded the Nobel Prize a year later). Neyman also worked towards the establishment of a Special Scholarships Committee at UC Berkeley with the goal of providing education opportunities to disadvantaged Americans

Neyman was not a pacifist. In the second world war he actively looked for ways he could contribute to the war effort. He is involved in statistical models that compute the optimal spacing of bombs by planes to clear a path across a beach of land mines. (When at a certain moment he needs specifics about the beach, a representative from the military who is not allowed to directly provide this information asks if Neyman has ever been to the seashore in France, to which Neyman replies he has been to Normandy, and the representative answers “Then use that beach!”). But Neyman early and actively opposed the Vietnam war, despite the risk of losing lucrative contracts the Statistical Laboratory had with the Department of Defense. In 1964 he joined a group of people who bought advertisements in local newspapers with a picture of a napalmed Vietnamese child with the quote “The American people will bluntly and plainly call it murder”.


A positive role model


It is important to know the history of a scientific discipline. Histories are complex, and we should resist overly simplistic narratives. If your teacher explains frequentist statistics to you, it is good if they highlight that someone like Fisher had questionable ideas about eugenics. But the early developments in frequentist statistics involved many researchers beyond Fisher, and, luckily, there are many more positive role-models that also deserve to be mentioned – such as Jerzy Neyman. Even though Neyman’s philosophy on statistical inferences forms the basis of how many scientists nowadays test hypotheses, his contributions and personal life are still often not discussed in histories of statistics – an oversight I hope the current blog post can somewhat mitigate. If you want to learn more about the history of statistics through Neyman’s personal life, I highly recommend the biography of Neyman by Constance Reid, which was the source for most of the content of this blog post.

 



[i] See Hacking, 1965: “The mature theory of Neyman and Pearson is very nearly the received theory on testing statistical hypotheses.”

[ii] It turns out, in the biography, that it was not Fisher, but A. C. Aitken, who reviewed the paper positively.

[iii] Clayton’s book seems to be mainly intended as an attempt to persuade readers to become a Bayesian, and not as an accurate analysis of the development of frequentist statistics.

[iv] William Gosset (or ‘Student’, from ‘Student’s t-test’), who was the main inspiration for the work by Neyman and Pearson, is another giant in frequentist statistics who does not in any way fit into the narrative that frequentist statistics is tied to eugenics, as his statistical work was motivated by applied research questions in the Guinness brewery. Gosset was a modest man – which is probably why he rarely receives the credit he is due.

[v] When asked about his attitude towards Bayesian statistics in 1979, he answered: “It does not interest me. I am interested in frequencies.” He did note multiple legitimate approaches to statistics exist, and the choice one makes is largely a matter of personal taste. Neyman opposed subjective Bayesian statistics because their use could lead to bad decision procedures, but was very positive about later work by Wald, which inspired Bayesian statistical decision theory.

 

P-values vs. Bayes Factors

In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?

 

What is the goal of statistical inference?

 

When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.

 

Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.

 

The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.

 

What is the advantage of your approach in a practical/applied context?

 

A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still  occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130  are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).

 

Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.

 

In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).

 

When do you think the other approach may be applicable?

 

Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.

 

Conclusion

 

In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.

Why I care about replication studies

In 2009 I attended a European Social Cognition Network meeting in Poland. I only remember one talk from that meeting: A short presentation in a nearly empty room. The presenter was a young PhD student – Stephane Doyen. He discussed two studies where he tried to replicate a well-known finding in social cognition research related to elderly priming, which had shown that people walked more slowly after being subliminally primed with elderly related words, compared to a control condition.

His presentation blew my mind. But it wasn’t because the studies failed to replicate – it was widely known in 2009 that these studies couldn’t be replicated. Indeed, around 2007, I had overheard two professors in a corridor discussing the problem that there were studies in the literature everyone knew would not replicate. And they used this exact study on elderly priming as one example. The best solution the two professors came up with to correct the scientific record was to establish an independent committee of experts that would have the explicit task of replicating studies and sharing their conclusions with the rest of the world. To me, this sounded like a great idea.

And yet, in this small conference room in Poland, there was this young PhD student, acting as if we didn’t need specially convened institutions of experts to inform the scientific community that a study could not be replicated. He just got up, told us about how he wasn’t able to replicate this study, and sat down.

It was heroic.

If you’re struggling to understand why on earth I thought this was heroic, then this post is for you. You might have entered science in a different time. The results of replication studies are no longer communicated only face to face when running into a colleague in the corridor, or at a conference. But I was impressed in 2009. I had never seen anyone give a talk in which the only message was that an original effect didn’t stand up to scrutiny. People sometimes presented successful replications. They presented null effects in lines of research where the absence of an effect was predicted in some (but not all) tests. But I’d never seen a talk where the main conclusion was just: “This doesn’t seem to be a thing”.

On 12 September 2011 I sent Stephane Doyen an email. “Did you ever manage to publish some of that work? I wondered what has happened to it.” Honestly, I didn’t really expect that he would manage to publish these studies. After all, I couldn’t remember ever having seen a paper in the literature that was just a replication. So I asked, even though I did not expect he would have been able to publish his findings.

Surprisingly enough, he responded that the study would soon appear in press. I wasn’t fully aware of new developments in the publication landscape, where Open Access journals such as PlosOne published articles as long as the work was methodologically solid, and the conclusions followed from the data. I shared this news with colleagues, and many people couldn’t wait to read the paper: An article, in print, reporting the failed replication of a study many people knew to be not replicable. The excitement was not about learning something new. The excitement was about seeing replication studies with a null effect appear in print.

Regrettably, not everyone was equally excited. The publication also led to extremely harsh online comments from the original researcher about the expertise of the authors (e.g., suggesting that findings can fail to replicate due to “Incompetent or ill-informed researchers”), and the quality of PlosOne (“which quite obviously does not receive the usual high scientific journal standards of peer-review scrutiny”). This type of response happened again, and again, and again. Another failed replication led to a letter by the original authors that circulated over email among eminent researchers in the area, was addressed to the original authors, and ended with “do yourself, your junior co-authors, and the rest of the scientific community a favor. Retract your paper.”

Some of the historical record on discussions between researchers around between 2012-2015 survives online, in Twitter and Facebook discussions, and blogs. But recently, I started to realize that most early career researchers don’t read about the replication crisis through these original materials, but through summaries, which don’t give the same impression as having lived through these times. It was weird to see established researchers argue that people performing replications lacked expertise. That null results were never informative. That thanks to dozens of conceptual replications, the original theoretical point would still hold up even if direct replications failed. As time went by, it became even weirder to see that none of the researchers whose work was not corroborated in replication studies ever published a preregistered replication study to silence the critics. And why were there even two sides to this debate? Although most people agreed there was room for improvement and that replications should play some role in improving psychological science, there was no agreement on how this should work. I remember being surprised that a field was only now thinking about how to perform and interpret replication studies if we had been doing psychological research for more than a century.
 

I wanted to share this autobiographical memory, not just because I am getting old and nostalgic, but also because young researchers are most likely to learn about the replication crisis through summaries and high-level overviews. Summaries of history aren’t very good at communicating how confusing this time was when we lived through it. There was a lot of uncertainty, diversity in opinions, and lack of knowledge. And there were a lot of feelings involved. Most of those things don’t make it into written histories. This can make historical developments look cleaner and simpler than they actually were.

It might be difficult to understand why people got so upset about replication studies. After all, we live in a time where it is possible to publish a null result (e.g., in journals that only evaluate methodological rigor, but not novelty, journals that explicitly invite replication studies, and in Registered Reports). Don’t get me wrong: We still have a long way to go when it comes to funding, performing, and publishing replication studies, given their important role in establishing regularities, especially in fields that desire a reliable knowledge base. But perceptions about replication studies have changed in the last decade. Today, it is difficult to feel how unimaginable it used to be that researchers in psychology would share their results at a conference or in a scientific journal when they were not able to replicate the work by another researcher. I am sure it sometimes happened. But there was clearly a reason those professors I overheard in 2007 were suggesting to establish an independent committee to perform and publish studies of effects that were widely known to be not replicable.

As people started to talk about their experiences trying to replicate the work of others, the floodgates opened, and the shells fell off peoples’ eyes. Let me tell you that, from my personal experience, we didn’t call it a replication crisis for nothing. All of a sudden, many researchers who thought it was their own fault when they couldn’t replicate a finding started to realize this problem was systemic. It didn’t help that in those days it was difficult to communicate with people you didn’t already know. Twitter (which is most likely the medium through which you learned about this blog post) launched in 2006, but up to 2010 hardly any academics used this platform. Back then, it wasn’t easy to get information outside of the published literature. It’s difficult to express how it feels when you realize ‘it’s not me – it’s all of us’. Our environment influences which phenotypic traits express themselves. These experiences made me care about replication studies.

If you started in science when replications were at least somewhat more rewarded, it might be difficult to understand what people were making a fuss about in the past. It’s difficult to go back in time, but you can listen to the stories by people who lived through those times. Some highly relevant stories were shared after the recent multi-lab failed replication of ego-depletion (see tweets by Tom Carpenter and Dan Quintana). You can ask any older researcher at your department for similar stories, but do remember that it will be a lot more difficult to hear the stories of the people who left academia because most of their PhD consisted of failures to build on existing work.

If you want to try to feel what living through those times must have been like, consider this thought experiment. You attend a conference organized by a scientific society where all society members get to vote on who will be a board member next year. Before the votes are cast, the president of the society informs you that one of the candidates has been disqualified. The reason is that it has come to the society’s attention that this candidate selectively reported results from their research lines: The candidate submitted only those studies for publication that confirmed their predictions, and did not share studies with null results, even though these null results were well designed studies that tested sensible predictions. Most people in the audience, including yourself, were already aware of the fact that this person selectively reported their results. You knew publication bias was problematic from the moment you started to work in science, and the field knew it was problematic for centuries. Yet here you are, in a room at a conference, where this status quo is not accepted. All of a sudden, it feels like it is possible to actually do something about a problem that has made you feel uneasy ever since you started to work in academia.

You might live through a time where publication bias is no longer silently accepted as an unavoidable aspect of how scientists work, and if this happens, the field will likely have a very similar discussion as it did when it started to publish failed replication studies. And ten years later, a new generation will have been raised under different scientific norms and practices, where extreme publication bias is a thing of the past. It will be difficult to explain to them why this topic was a big deal a decade ago. But since you’re getting old and nostalgic yourself, you think that it’s useful to remind them, and you just might try to explain it to them in a 2 minute TikTok video.

History merely repeats itself. It has all been done before. Nothing under the sun is truly new.
Ecclesiastes 1:9

Thanks to Farid Anvari, Ruben Arslan, Noah van Dongen, Patrick Forscher, Peder Isager, Andrea Kis, Max Maier, Anne Scheel, Leonid Tiokhin, and Duygu Uygun for discussing this blog post with me (and in general for providing such a stimulating social and academic environment in times of a pandemic).

The p-value misconception eradication challenge

If you have educational material that you think will do a better job at preventing p-value misconceptions than the material in my MOOC, join the p-value misconception eradication challenge by proposing an improvement to my current material in a new A/B test in my MOOC.

I launched a massive open online course “Improving your statistical inferences” in October 2016. So far around 47k students have enrolled, and the evaluations suggest it has been a useful resource for many researchers. The first week focusses on p-values, what they are, what they aren’t, and how to interpret them.

Arianne Herrera-Bennet was interested in whether an understanding of p-values was indeed “impervious to correction” as some statisticians believe (Haller & Krauss, 2002, p. 1) and collected data on accuracy rates on ‘pop quizzes’ between August 2017 and 2018 to examine if there was any improvement in p-value misconceptions that are commonly examined in the literature. The questions were asked at the beginning of the course, after relevant content was taught, and at the end of the course. As the figure below from the preprint shows, there was clear improvement, and accuracy rates were quite high for 5 items, and reasonable for 3 items.

 

We decided to perform a follow-up from September 2018 where we added an assignment to week one for half the students in an ongoing A/B test in the MOOC. In this new assignment, we didn’t just explain what p-values are (as in the first assignment in the module all students do) but we also tried to specifically explain common misconceptions, to explain what p-values are not. The manuscript is still in preparation, but there was additional improvement for at least some misconceptions. It seems we can develop educational material that prevents p-value misconceptions – but I am sure more can be done. 

In my paper to appear in Perspectives on Psychological Science on “The practical alternative to the p-value is the correctly used p-value” I write:

“Looking at the deluge of papers published in the last half century that point out how researchers have consistently misunderstood p-values, I am left to wonder: Where is the innovative coordinated effort to create world class educational materials that can freely be used in statistical training to prevent such misunderstandings? It is nowadays relatively straightforward to create online apps where people can simulate studies and see the behavior of p values across studies, which can easily be combined with exercises that fit the knowledge level of bachelor and master students. The second point I want to make in this article is that a dedicated attempt to develop evidence based educational material in a cross-disciplinary team of statisticians, educational scientists, cognitive psychologists, and designers seems worth the effort if we really believe young scholars should understand p values. I do not think that the effort statisticians have made to complain about p-values is matched with a similar effort to improve the way researchers use p values and hypothesis tests. We really have not tried hard enough.”

If we honestly feel that misconceptions of p-values are a problem, and there are early indications that good education material can help, let’s try to do all we can to eradicate p-value misconceptions from this world.

We have collected enough data in the current A/B test. I am convinced the experimental condition adds some value to people’s understanding of p-values, so I think it would be best educational practice to stop presenting students with the control condition.

However, I there might be educational material out there that does a much better job than the educational material I made, to train away misconceptions. So instead of giving all students my own new assignment, I want to give anyone who thinks they can do an even better job the opportunity to demonstrate this. If you have educational material that you think will work even better than my current material, I will create a new experimental condition that contains your teaching material. Over time, we can see which materials performs better, and work towards creating the best educational material to prevent misunderstandings of p-values we can.

If you are interested in working on improving p-value education material, take a look at the first assignment in the module that all students do, and look at the new second assignment I have created to train away misconception (and the answers). Then, create (or adapt) educational material such that the assignment is similar in length and content. The learning goal should be to train away common p-value misconceptions – you can focus on any and all you want. If there are multiple people who are interested, we collectively vote on which material we should test first (but people are free to combine their efforts, and work together on one assignment). What I can offer is getting your material in front of between 300 and 900 students who enroll each week. Not all of them will start, not all of them will do the assignments, but your material should reach at least several hundreds of learners a year, of which around 40% has a masters degree, and 20% has a PhD – so you will be teaching fellow scientists (and beyond) to improve how they work.  

I will incorporate this new assignment, and make it publicly available on my blog, as soon as it is done and decided on by all people who expressed interest in creating high quality teaching material. We can evaluate the performance by looking at the accuracy rates on test items. I look forward to seeing your material, and hope this can be a small step towards an increased effort in improving statistics education. We might have a long way to go to completely eradicate p-value misconceptions, but we can start.

P-hacking and optional stopping have been judged violations of scientific integrity

On July 28, 2020, the first Dutch academic has been judged to have violated the code of conduct for research integrity for p-hacking and optional stopping with the aim of improving the chances of obtaining a statistically significant result. I think this is a noteworthy event that marks a turning point in the way the scientific research community interprets research practices that up to a decade ago were widely practiced. The researcher in question violated scientific integrity in several other important ways, including withdrawing blood without ethical consent, and writing grant proposals in which studies and data were presented that had not been performed and collected. But here, I want to focus on the judgment about p-hacking and optional stopping.

When I studied at Leiden University from 1998 to 2002 and commuted by train from my hometown of Rotterdam I would regularly smoke a cigar in the smoking compartment of the train during my commute. If I would enter a train today and light a cigar, the responses I would get from my fellow commuters would be markedly different than 20 years ago. They would probably display moral indignation or call the train conductor who would give me a fine. Times change.

When the reporton the fraud case of Diederik Stapel came out, the three committees were surprised by a research culture that accepted “sloppy science”. But it did not directly refer to these practices as violations of the code of conduct for research integrity. For example, on page 57 they wrote:

 “In the recommendations, the Committees not only wish to focus on preventing or reducing fraud, but also on improving the research culture. The European Code refers to ‘minor misdemeanours’: some data massage, the omission of some unwelcome observations, ‘favourable’ rounding off or summarizing, etc. This kind of misconduct is not categorized under the ‘big three’ (fabrication, falsification, plagiarism) but it is, in principle, equally unacceptable and may, if not identified or corrected, easily lead to more serious breaches of standards of integrity.”

Compare this to the reportby LOWI, the Dutch National Body for Scientific Integrity, for a researcher at Leiden University who was judged to violate the code of conduct for research integrity for p-hacking and optional stopping (note this is my translation from Dutch of the advice on page 17 point IV, and point V on page 4):

“The Board has rightly ruled that Petitioner has violated standards of academic integrity with regard to points 2 to 5 of the complaint.”

With this, LOWI has judged that the Scientific Integrity Committee of Leiden University (abbreviated as CWI in Dutch) ruled correctly with respect to the following:

“According to the CWI, the applicant also acted in violation of scientific integrity by incorrectly using statistical methods (p-hacking) by continuously conducting statistical tests during the course of an experiment and by supplementing the research population with the aim of improving the chances of obtaining a statistically significant result.”

As norms change, what we deemed a misdemeanor before, is now simply classified as a violation of academic integrity. I am sure this is very upsetting for this researcher. We’ve seen similar responses in the past years, where single individuals suffered more than average researcher for behaviors that many others performed as well. They might feel unfairly singled out. The only difference between this researcher at Leiden University, and several others who performed identical behaviors, was that someone in their environment took the 2018 Netherlands Code of Conduct for Research Integrity seriously when they read section 3.7, point 56:

Call attention to other researchers’ non-compliance with the standards as well as inadequate institutional responses to non-compliance, if there is sufficient reason for doing so.

When it comes to smoking, rules in The Netherlands are regulated through laws. You’d think this would mitigate confusion, insecurity, and negative emotions during a transition – but that would be wishful thinking. In The Netherlands the whole transition has been ongoing for close to two decades, from an initial law allowing a smoke-free working environment in 2004, to a completely smoke-free university campus in August 2020.

The code of conduct for research integrity is not governed by laws, and enforcement of the code of conduct for research integrity is typically not anyone’s full time job. We can therefore expect the change to be even more slow than the changes in what we feel is acceptable behavior when it comes to smoking. But there is notable change over time.

We see a shift from the “big three” types of misconduct (fabrication, falsification, plagiarism), and somewhat vague language of misdemeanors, that is “in principle” equally unacceptable, and might lead to more serious breaches of integrity, to a clear classification of p-hacking and optional stopping as violations of scientific integrity. Indeed, if you ask me, the ‘bigness’ of plagiarism pales compared to how publication bias and selective reporting distort scientific evidence.

Compare this to smoking laws in The Netherlands, where early on it was still allowed to create separate smoking rooms in buildings, while from August 2020 onwards all school and university terrain (i.e., the entire campus, inside and outside of the buildings) needs to be a smoke-free environment. Slowly but sure, what is seen as acceptable changes.

I do not consider myself to be an exceptionally big idiot – I would say I am pretty average on that dimension – but it did not occur to me how stupid it was to enter a smoke-filled train compartment and light up a cigar during my 30 minute commute around the year 2000. At home, I regularly smoked a pipe (a gift from my father). I still have it. Just looking at the tar stains now makes me doubt my own sanity.

 


This is despite that fact that the relation between smoking and cancer was pretty well established since the 1960’s. Similarly, when I did my PhD between 2005 and 2009 I was pretty oblivious to the error rate inflation due to optional stopping, despite that fact that one of the more important papers on this topic was published by Armitage, McPherson, and Rowe in 1969. I did realize that flexibility in analyzing data could not be good for the reliability of the findings we reported, but just like when I lit a cigar in the smoking compartment in the train, I failed to adequately understand how bad it was.

When smoking laws became stricter, there was a lot of discussion in society. One might even say there was a strong polarization, where on the one hand newspaper articles appeared that claimed how outlawing smoking in the train was ‘totalitarian’, while we also had family members who would no longer allow people to smoke inside their house, which led my parents (both smokers) to stop visiting these family members. Changing norms leads to conflict. People feel personally attacked, they become uncertain, and in the discussions that follow we will see all opinions ranging from how people should be free to do what they want, to how people who smoke should pay more for healthcare.

We’ve seen the same in scientific reform, although the discussion is more often along the lines of how smoking can’t be that bad if my 95 year old grandmother has been smoking a packet a day for 70 years and feels perfectly fine, to how alcohol use or lack of exercise are much bigger problems and why isn’t anyone talking about those.

But throughout all this discussion, norms just change. Even my parents stopped smoking inside their own home around a decade ago. The Dutch National Body for Scientific Integrity has classified p-hacking and optional stopping as violations of research integrity. Science is continuously improving, but change is slow. Someone once explained to me that correcting the course of science is like steering an oil tanker – any change in direction takes a while to be noticed. But when change happens, it’s worth standing still to reflect on it, and look at how far we’ve come.

The Red Team Challenge (Part 4): The Wildcard Reviewer

This is a guest blog by Tiago Lubiana, Ph.D. Candidate in Bioinformatics, University of São Paulo.
Read also Part 1, Part 2, and Part 3 of The Red Team Challenge

Two remarkable moments as a researcher are publishing your first first-author article and the first time a journal editor asks you to review a paper.

Well, at least I imagine so. I haven’t experienced either yet. Yet,for some reason, the author of the Red Team Challenge accepted me as a (paid) reviewer for their audacious project.

I believe I am one of the few scientists to receive money for a peer review before doing any unpaid peer-reviews. I’m also perhaps one of the few to review a paper before having any first-author papers. Quite likely, I am the first one to do both at the same time.

I am, nevertheless, a science aficionado. I’ve breathed science for the past 9 years, working in10 different laboratories before joining the Computational Systems Biology Lab at the University of São Paulo, where I am pursuing my PhD. I like this whole thing of understanding more about the world, reading, doing experiments, sharing findings with people. That is my thing (to be fair, that is likely our thing).

I also had my crises with the scientific world. A lot of findings in the literature are contradictory. And many others are simply wrong. And they stay wrong, right? It is incredible, but people usually do not update articles even given a need for corrections. The all-powerful, waxy stamp of peer-reviewed is given to a monolithic text-and-figure-and-table PDF, and this pdf is then frozen forever in the hall of fame. And it costs a crazy amount of money to lock this frozen pdf behind paywalls.

I have always been very thorough in my evaluation of any work. With time, I discovered that, for some reason, people don’t like to have their works criticized (who would imagine, huh?). That can be attenuated by a lot of training in how to communicate. But even then, people frown upon even constructive criticism. If it is a critic about something that is already published, then it is even worse. So I got quite excited when I saw this call for people to have a carte blanche to criticize a piece of work as much as possible.

I got to know the Red Team Challenge via a Whatsapp message sent by Olavo Amaral, who leads the Brazilian Reproducibility Initiative. Well, it looked cool, it paid a fairly decent amount of money, and the application was simple (it did not require a letter of recommendation or anything like this). So I thought: “Why not? I do not know a thing about psychology, but I can maybe spot a few uncorrected multiple comparisons here and there, and I can definitely look at the code.”

I got lucky that the Blue Team (Coles et al.) had a place for random skills (the so-called wildcard category) in their system for selecting reviewers. About a week after applying, I received a message in my mail that stated that I had been chosen as a reviewer. “Great! What do I do now?”

I was obviously a bit scared of making a big blunder or at least just making it way below expectations. But there was a thing that tranquilized me: I was hired. This was not an invitation based on expectations or a pre-existing relationship with the ones hiring me. People were actually paying me, and my tasks for the job were crystal clear.

If I somehow failed to provide a great review, it would not affect my professional life whatsoever (I guess). I had just the responsibility to do a good job that any person has after signing a contract.

I am not a psychologist by training, and so I knew beforehand that the details of the work would be beyond my reach. However, after reading the manuscript, I felt even worse: the manuscript was excellent. Or I mean, at least a lot of care was taken when planning and important experimental details as far as I could tell as an outsider.

It is not uncommon for me to cringe upon a few dangling uncorrected p-values here and there, even when reading something slightly out of my expertise. Or to find some evidence of optional stopping. Or pinpoint some statistical tests from which you cannot really tell what the null hypothesis is and what actually is being tested.

That did not happen. However, everyone involved knew that I was not a psychologist. I was plucked from the class of miscellaneous reviewers. From the start, I knew that I could contribute the most by reviewing the code.

I am a computational biologist, and our peers in the computer sciences usually look down on our code. For example, software engineers called a high profile epidemiological modeling code a “null I would say that lack of computational reproducibility is pervasive throughout science, and not restricted to a discipline or the other.

Luckily, I have always been interested in best practices. I might not always follow them (“do what I say not what I do”), mainly because of environmental constraints. Journals don’t require clean code, for example. And I’ve never heard about “proofreadings” of scripts that come alongside papers.

It was a pleasant surprise to see that the code from the paper was good, better than most of the code I’ve seen in biology-related scripts. It was filled with comments. The required packages were laid down at the beginning of the script. The environment was cleared in the first line so to avoid dangling global variables.

These are all good practices. If journals asked reviewers to check code (which they usually do not), it would come out virtually unscathed.

But I was being paid to review it, so I had to find something. There are some things that can improve one’s code and make it much easier to check and validate. One can avoid commenting too much by using clear variable names, and you do not have to lay down the packages used if the code is containerized (with Docker, for example). A bit of refactoring could be done here and there, also, extracting out functions that were repeatedly used across the code. That was mostly what my review focused on, honestly.

Although these things are relatively minor, they do make a difference. It is a bit like the difference in prose between a non-writer and an experienced writer. The raw content might be the same, but the effectiveness of communication can vary a lot. And reading code can be already challenging, so it is always good to make it easier for the reader (and the reviewer, by extension).

Anyways, I have sent 11 issue reports (below the mean of ~20, but precisely the median of 11 reports/reviewer), and Ruben Arslan, the neutral arbiter, considered one of them to be a major issue. Later, Daniël and Nicholas mentioned that the reviews were helpful, so I am led to believe that somehow I contributed to future improvements in this report. Science wins, I guess.

One interesting aspect of being hired by the authors is that I did not feel compelled to state whether I thought the work was relevant or novel. The work is obviously important for the authors who hired me. The current peer-review system mixes the evaluation of thoroughness and novelty under the same brand. That might be suboptimal in some cases. A good reviewer for statistics, or code, for example, might not feel that they can tell how much a “contribution is significant or only incremental,” as currently required. If that was a requirement for the Red Team Challenge, I would not have been able to be a part of the Red Team.

This mix of functions may be preventing us from getting more efficient reviews. We know that gross mistakes pass peer review. I’d trust a regularly updated preprint, with thorough, open, commissioned peer review, for example. I am sure we can come up with better ways of giving “this-is-good-science” stamps and improve the effectiveness of peer reviews.

To sum up, it felt very good to be in a system with the right incentives. Amidst this whole pandemic thing and chaos everywhere, I ended up being part of something really wonderful. Nicholas, Daniël, and all the others involved in the Red Team challenge are providing prime evidence that an alternate system is viable. Maybe one day, we will have reviewer-for-hire marketplaces and more adequate review incentives. When that day comes, I will be there, be it hiring or being hired.

The Red Team Challenge (Part 3): Is it Feasible in Practice?

By Daniel Lakens & Leo Tiokhin

Also read Part 1 and Part 2 in this series on our Red Team Challenge.


Six weeks ago, we launched the Red Team Challenge: a feasibility study to see whether it could be worthwhile to pay people to find errors in scientific research. In our project, we wanted to see to what extent a “Red Team” – people hired to criticize a scientific study with the goal to improve it – would improve the quality of the resulting scientific work.

Currently, the way that error detection works in science is a bit peculiar. Papers go through the peer-review process and get the peer-reviewed “stamp of approval”. Then, upon publication, some of these same papers receive immediate and widespread criticism. Sometimes this even leads to formal corrections or retractions. And this happens even at some of the most prestigious scientific journals.

So, it seems that our current mechanisms of scientific quality control leave something to be desired. Nicholas Coles, Ruben Arslan, and the authors of this post (Leo Tiokhin and Daniël Lakens) were interested in whether Red Teams might be one way to improve quality control in science.

Ideally, a Red Team joins a research project from the start and criticizes each step of the process. However, doing this would have taken the duration of an entire study. At the time, it also seemed a bit premature — we didn’t know whether anyone would be interested in a Red Team approach, how it would work in practice, and so on. So, instead, Nicholas Coles, Brooke Frohlich, Jeff Larsen, and Lowell Gaertner volunteered one of their manuscripts (a completed study that they were ready to submit for publication). We put out a call on Twitter, Facebook, and the 20% Statistician blog, and 22 people expressed interest. On May 15th, we randomly selected five volunteers based on five areas of expertise: Åse Innes-Ker (affective science), Nicholas James (design/methods), Ingrid Aulike (statistics), Melissa Kline (computational reproducibility), and Tiago Lubiana (wildcard category). The Red Team was then given three weeks to report errors.

Our Red Team project was somewhat similar to traditional peer review, except that we 1) compensated Red Team members’ time with a $200 stipend, 2) explicitly asked the Red Teamers to identify errors in any part of the project (i.e., not just writing), 3) gave the Red Team full access to the materials, data, and code, and 4) provided financial incentives for identifying critical errors (a donation to the GiveWell charity non-profit for each unique “critical error” discovered).

The Red Team submitted 107 error reports. Ruben Arslan–who helped inspire this project with his BugBountyProgram–served as the neutral arbiter. Ruben examined the reports, evaluated the authors’ responses, and ultimately decided whether an issue was “critical” (see this post for Ruben’s reflection on the Red Team Challenge) Of the 107 reports, Ruben concluded that there were 18 unique critical issues (for details, see this project page). Ruben decided that any major issues that potentially invalidated inferences were worth $100, minor issues related to computational reproducibility were worth $20, and minor issues that could be resolved without much work were worth $10. After three weeks, the total final donation was $660. The Red Team detected 5 major errors. These included two previously unknown limitations of a key manipulation, inadequacies in the design and description of the power analysis, an incorrectly reported statistical test in the supplemental materials, and a lack of information about the sample in the manuscript. Minor issues concerned reproducibility of code and clarifications about the procedure.



After receiving this feedback, Nicholas Coles and his co-authors decided to hold off submitting their manuscript (see this post for Nicholas’ personal reflection). They are currently conducting a new study to address some of the issues raised by the Red Team.

We consider this to be a feasibility study of whether a Red Team approach is practical and worthwhile. So, based on this study, we shouldn’t draw any conclusions about a Red Team approach in science except one: it can be done.

That said, our study does provide some food for thought. Many people were eager to join the Red Team. The study’s corresponding author, Nicholas Coles, was graciously willing to acknowledge issues when they were pointed out. And it was obvious that, had these issues been pointed out earlier, the study would have been substantially improved before being carried out. These findings make us optimistic that Red Teams can be useful and feasible to implement.

In an earlier column, the issue was raised that rewarding Red Team members with co-authorship on the subsequent paper would create a conflict of interest — too severe criticism on the paper might make the paper unpublishable. So, instead, we paid each Red Teamer $200 for their service. We wanted to reward people for their time. We did not want to reward them only for finding issues because, before we knew that 19 unique issues would be found, we were naively worried that the Red Team might find few things wrong with the paper. In interviews with Red Team members, it became clear that the charitable donations for each issue were not a strong motivator. Instead, people were just happy to detect issues for decent pay. They didn’t think that they deserved authorship for their work, and several Red Team members didn’t consider authorship on an academic paper to be valuable, given their career goals.

After talking with the Red Team members, we started to think that certain people might enjoy Red Teaming as a job – it is challenging, requires skills, and improves science. This opens up the possibility of a freelance services marketplace (such as Fiverr) for error detection, where Red Team members are hired at an hourly rate and potentially rewarded for finding errors. It should be feasible to hire people to check for errors at each phase of a project, depending on their expertise and reputation as good error-detectors. If researchers do not have money for such a service, they might be able to set up a volunteer network where people “Red Team” each other’s projects. It could also be possible for universities to create Red Teams (e.g., Cornell University has a computational reproducibility service that researchers can hire).

As scientists, we should ask ourselves when, and for which type of studies, we want to invest time and/or money to make sure that published work is as free from errors as possible. As we continue to consider ways to increase the reliability of science, a Red Team approach might be something to further explore.

Red Team Challenge

by Nicholas A. Coles, Leo Tiokhin, Ruben Arslan, Patrick Forscher, Anne Scheel, & Daniël Lakens

All else equal, scientists should trust studies and theories that have been more critically evaluated. The more that a scientific product has been exposed to processes designed to detect flaws, the more that researchers can trust the product (Lakens, 2019; Mayo, 2018). Yet, there are barriers to adopting critical approaches in science. Researchers are susceptible to biases, such as confirmation bias, the “better than average” effect, and groupthink. Researchers may gain a competitive advantage for jobs, funding, and promotions by sacrificing rigor in order to produce larger quantities of research (Heesen, 2018; Higginson & Munafò, 2016) or to win priority races (Tiokhin & Derex, 2019). And even if researchers were transparent enough to allow others to critically examine their materials, code, and ideas, there is little incentive for others–including peer reviewers–to do so. These combined factors may hinder the ability of science to detect errors and self-correct (Vazire, 2019).

Today we announce an initiative that we hope can incentivize critical feedback and error detection in science: the Red Team Challenge. Daniël Lakens and Leo Tiokhin are offering a total of $3,000 for five individuals to provide critical feedback on the materials, code, and ideas in the forthcoming preprint titled “Are facial feedback effects solely driven by demand characteristics? An experimental investigation”. This preprint examines the role of demand characteristics in research on the controversial facial feedback hypothesis: the idea that an individual’s facial expressions can influence their emotions. This is a project that Coles and colleagues will submit for publication in parallel with the Red Team Challenge. We hope that challenge will serve as a useful case study of the role Red Teams might play in science.

We are looking for five individuals to join “The Red Team”. Unlike traditional peer review, this Red Team will receive financial incentives to identify problems. Each Red Team member will receive a $200 stipend to find problems, including (but not limited to) errors in the experimental design, materials, code, analyses, logic, and writing. In addition to these stipends, we will donate $100 to a GoodWell top ranked charity (maximum total donations: $2,000) for every new “critical problem” detected by a Red Team member. Defining a “critical problem” is subjective, but a neutral arbiter–Ruben Arslan–will make these decisions transparently. At the end of the challenge, we will release: (1) the names of the Red Team members (if they wish to be identified), (2) a summary of the Red Team’s feedback, (3) how much each Red Team member raised for charity, and (4) the authors’ responses to the Red Team’s feedback.

If you are interested in joining the Red Team, you have until May 14th to sign up here. At this link, you will be asked for your name, email address, and a brief description of your expertise. If more than five people wish to join the Red Team, we will ad-hoc categorize people based on expertise (e.g., theory, methods, reproducibility) and randomly select individuals from each category. On May 15th, we will notify people whether they have been chosen to join the Red Team.

For us, this is a fun project for several reasons. Some of us are just interested in the feasibility of Red Team challenges in science (Lakens, 2020). Others want feedback about how to make such challenges more scientifically useful and to develop best practices. And some of us (mostly Nick) are curious to see what good and bad might come from throwing their project into the crosshairs of financially-incentivized research skeptics. Regardless of our diverse motivations, we’re united by a common interest: improving science by recognizing and rewarding criticism (Vazire, 2019).

References
Heesen, R. (2018). Why the reward structure of science makes reproducibility problems inevitable. The Journal of Philosophy, 115(12), 661-674.
Higginson, A. D., & Munafò, M. R. (2016). Current incentives for scientists lead to underpowered studies with erroneous conclusions. PLoS Biology, 14(11), e2000995.
Lakens, D. (2019). The value of preregistration for psychological science: A conceptual analysis. Japanese Psychological Review.

Lakens, D. (2020). Pandemic researchers — recruit your own best critics. Nature, 581, 121.
Mayo, D. G. (2018). Statistical inference as severe testing. Cambridge: Cambridge University Press.
Tiokhin, L., & Derex, M. (2019). Competition for novelty reduces information sampling in a research game-a registered report. Royal Society Open Science, 6(5), 180934.
Vazire, S. (2019). A toast to the error detectors. Nature, 577(9).

What’s a family in family-wise error control?

When you perform multiple comparisons in a study, you need to control your alpha level for multiple comparisons. It is generally recommended to control for the family-wise error rate, but there is some confusion about what a ‘family’ is. As Bretz, Hothorn, & Westfall (2011) write in their excellent book “Multiple Comparisons Using R” on page 15: “The appropriate choice of null hypotheses being of primary interest is a controversial question. That is, it is not always clear which set of hypotheses should constitute the family H1,…,Hm. This topic has often been in dispute and there is no general consensus.” In one of the best papers on controlling for multiple comparisons out there, Bender & Lange (2001) write: “Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions. In addition to the problem of deciding which error rate should be under control, it has to be defined first which tests of a study belong to one experiment.” The Wikipedia page on family-wise error rate is a mess.

I will be honest: I have never understood this confusion about what a family of tests is when controlling the family-wise error rate. At least not in a Neyman-Pearson approach to hypothesis testing, where the goal is to use data to make decisions about how to act. Neyman (Neyman, 1957) calls his approach inductive behavior. The outcome of an experiment leads one to take different possible actions, which can be either practical (e.g., implement a new procedure, abandon a research line) or scientific (e.g., claim there is or is no effect). From an error-statistical approach (Mayo, 2018) inflated Type 1 error rates mean that it has become very likely that you will be able to claim support for your hypothesis, even when the hypothesis is wrong. This reduces the severity of the test. To prevent this, we need to control our error rate at the level of our claim.
One reason the issue of family-wise error rates might remain vague, is that researchers are often vague about their claims. We do not specify our hypotheses unambiguously, and therefore this issue remains unclear. To be honest, I suspect another reason there is a continuing debate about whether and how to lower the alpha level to control for multiple comparisons in some disciplines is that 1) there are a surprisingly large number of papers written on this topic that argue you do not need to control for multiple comparisons, which are 2) cited a huge number of times giving rise to the feeling that surely they must have a point. Regrettably, the main reason these papers are written is because there are people who don’t think a Neyman-Pearson approach to hypothesis testing is a good idea, and the main reason these papers are cited is because doing so is convenient for researchers who want to publish statistically significant results, as they can justify why they are not lowering their alpha level, making that p = 0.02 in one of three tests really ‘significant’. All papers that argue against the need to control for multiple comparisons when testing hypotheses are wrong.  Yes, their existence and massive citation counts frustrate me. It is fine not to test a hypothesis, but when you do, and you make a claim based on a test, you need to control your error rates. 

But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.

An example of formalizing family-wise error control

Let’s assume we collect data from 100 participants in a control and treatment condition. We collect 3 dependent variables (dv1, dv2, and dv3). In the population there is no difference between groups on any of these three variables (the true effect size is 0). We will analyze the three dv’s in independent t-tests. This requires specifying our alpha level, and thus deciding whether we need to correct for multiple comparisons. How we control error rates depends on claim we want to make.
We might want to act as if (or claim that) our treatment works if there is a difference between the treatment and control conditions on any of the three variables. In scienceverse terms, this means we consider the prediction corroborated when the p-value of the first t-test is smaller than alpha level, the p-value of the second t-test is smaller than the alpha level, or the p-value of the first t-test is smaller than the alpha level. In the scienceverse code, we specify a criterion for each test (a p-value smaller than the alpha level, p.value < alpha_level) and conclude the hypothesis is corroborated if either of these criteria are met (“p_t_1 | p_t_2 | p_t_3”).  
We could also want to make three different predictions. Instead of one hypothesis (“something will happen”) we have three different hypotheses, and predict there will be an effect on dv1, dv2, and dv3. The criterion for each t-test is the same, but we now have three hypotheses to evaluate (H1, H2, and H3). Each of these claims can be corroborated, or not.
Scienceverse allows you to specify your hypotheses tests unambiguously (for code used in this blog, see the bottom of the post). It also allows you to simulate a dataset, which we can use to examine Type 1 errors by simulating data where no true effects exist. Finally, scienceverse allows you to run the pre-specified analyses on the (simulated) data, and will automatically create a report that summarizes which hypotheses were corroborated (which is useful when checking if the conclusions in a manuscript indeed follow from the preregistered analyses, or not). The output a single simulated dataset for the scenario where we will interpret any effect on the three dv’s as support for the hypothesis looks like this:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

Something will happen

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if anything is significant.

 p_t_1 | p_t_2 | p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if nothing is significant.

 !p_t_1 & !p_t_2 & !p_t_3 

All criteria were met for corroboration.

We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 – even though this was a Type 1 error, since we simulated data with a true effect size of 0 – and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

dv1 will show an effect

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv1 is significant.

 p_t_1 

Falsification ( TRUE )

The hypothesis is falsified if dv1 is not significant.

 !p_t_1 

All criteria were met for falsification.

Hypothesis 2: H2

dv2 will show an effect

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv2 is significant.

 p_t_2 

Falsification ( TRUE )

The hypothesis is falsified if dv2 is not significant.

 !p_t_2 

All criteria were met for falsification.

Hypothesis 3: H3

dv3 will show an effect

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if dv3 is significant.

 p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if dv3 is not significant.

 !p_t_3 

All criteria were met for corroboration.

We now see that two hypotheses were falsified (yes, yes, I know you should not use p > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don’t also have to explain equivalence testing to readers not familiar with it – if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.

It might seem that the second approach, specifying each dv as it’s own hypothesis, is the way to go if you do not want to lower the alpha level to control for multiple comparisons. But take a look at the report of the study you have performed. You have made 3 predictions, of which 1 was corroborated. That is not an impressive success rate. Sure, mixed results happen, and you should interpret results not just based on the p-value (but on the strength of the experimental design, assumptions about power, your prior, the strength of the theory, etc.) but if these predictions were derived from the same theory, this set of results is not particularly impressive. Since researchers can never selectively report only those results that ‘work’ because this would be a violation of the code of research integrity, we should always be able to see the meager track record of predictions.If you don’t feel ready to make a specific predictions (and run the risk of sullying your track record) either do unplanned exploratory tests, and do not make claims based on their results, or preregister all possible tests you can think of, and massively lower your alpha level to control error rates (for example, genome-wide association studies sometimes use an alpha level of 5 x 10–8 to control the Type 1 erorr rate).

Hopefully, specifying our hypotheses (and what would corroborate them) transparently by using scienceverse makes it clear what happens in the long run in both scenarios. In the long run, both the first scenario, if we would use an alpha level of 0.05/3 instead of 0.05, and the second scenario, with an alpha level of 0.05 for each individual hypothesis, will lead to the same end result: Not more than 5% of our claims will be wrong, if the null hypothesis is true. In the first scenario, we are making one claim in an experiment, and in the second we make three. In the second scenario we will end up with more false claims in an absolute sense, but the relative number of false claims is the same in both scenarios. And that’s exactly the goal of family-wise error control.
References
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. CRC Press.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.
Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7. https://doi.org/10.2307/1401671

Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.


Review of "Do Effect Sizes in Psychology Laboratory Experiments Mean Anything in Reality?"

Researchers spend a lot of time reviewing papers. These reviews are rarely made public. Sometimes reviews might be useful for readers of an article. Here, I’m sharing my review of “Do Effect Sizes in Psychology Laboratory Experiments Mean Anything in Reality? by Roy Baumeister. I reviewed this (blinded) manuscript in December 2019 for a journal where it was rejected January 8 based on 2 reviews. Below you can read the review as I submitted it. I am sharing this review because the paper was accepted at another journal. 

In this opinion piece the authors try to argue for the lack of theoretical meaning of effect sizes in psychology. The opinion piece makes a point I think most reasonable people already agreed upon (social psychology makes largely ordinal predictions). The question is how educational and well argued their position is that this means effect sizes are theoretically not that important. On that aspect, I found the paper relatively weak. Too many statements are overly simplistic and the text is far behind on the state of the art (it reads as if it was written 20 years ago). I think a slightly brushed up version might make a relatively trivial but generally educational point for anyone not up to speed on this topic. If the author put in a bit more effort to have a discussion that incorporates the state of the art, this could be a more nuanced piece that has a bit stronger analysis of the issues at play, and a bit more vision about where to go. I think the latter would be worth reading for a general audience at this journal.

Main points.

1) What is the real reason our theories do not predict effect sizes?

The authors argue how most theories in social psychology are verbal theories. I would say most verbal theories in social psych are actually not theories (Fiedler, 2004) but closer to tautologies. That being said, I think the anecdotal examples the authors use throughout their paper (obedience effects, bystander effect, cognitive dissonance) are all especially weak, and the paper would be improved if all these examples are removed. Although the authors are correct in stating we hardly care about the specific effect size in those studies, I am not interested in anecdotes of studies where we know we do not care about effect sizes. This is a weak (as in, not severe) test of the argument the authors are making, with confirmation bias dripping from every sentence. If you want to make a point about effect sizes not mattering, you can easily find situations where they do not theoretically matter. But this is trivial and boring. What would be more interesting is an analysis of why we do not care about effect sizes that generalizes beyond the anecdotal examples. The authors are close, but do not provide such a general argument yet. One reason I think the authors would like to mention is that there is a measurement crisis in psych – people use a hodgepodge of often completely unvalidated measures. It becomes a lot more interesting to quantify effect sizes if we all use the same measurement. This would also remove the concern about standardized vs unstandardized effect sizes. But more generally, I think the authors should make an argument more from basic principles, than based on anecdotes, if they want to be convincing.

2) When is something an ordinal prediction?

Now, if we use the same measures, are there cases where we predict effect sizes? The authors argue we never predict an exact effect size. True, but again, uninteresting. We can predict a range of effect sizes. The authors cite Meehl, but they should really incorporate Meehl’s point from his 1990 paper. Predicting a range of effect sizes is already quite something, and the authors do not give this a fair discussion. It matters a lot if I predict an effect in the range of 0.2 to 0.8, even though this is very wide, then if I say I predict any effect larger than zero. Again, a description of the state of the art is missing in the paper. This question has been discussed in many literatures the authors do not mention. The issue is the same as the discussion about whether we should use a uniform prior in Bayesian stats, or a slightly more informative prior, because we predict effects in some range. My own work specifying the smallest effect size of interest also provides quite a challenge to the arguments of the current authors. See especially the example of Burriss and colleages in our 2018 paper (Lakens, Scheel, & Isager, 2018). They predicted an effect should be noticeable with the naked eye, and it is an example where a theory very clearly makes a range prediction, falsifying the authors arguments in the current paper. That these cases exist, means the authors are completely wrong in their central thesis. It also means they need to rephrase their main argument – when do we have range predictions, and when are we predicting *any* effect that is not zero. And why? I think many theories in psychology would argue that effects should be larger than some other effect size. This can be sufficient to make a valid range prediction. Similarly, if psychologist would just think about effect sizes that are too large, we would not have papers in PNAS edited by Nobel prize winners that think the effect of judges on parole decisions over time is a psychological mechanism, when the effect size is too large to be plausible (Glöckner, 2016). So effects should often be larger or smaller than some value, and this does not align with the current argument by the authors.  
I would argue the fact that psych theories predict range effects means psych theories make effect size predictions that are relevant enough to quantify. We do a similar thing when we compare 2 conditions, for example when we predict an effect is larger in condition X than Y. In essence, this means we say: We predict the effect size of X to be in a range that is larger than the effect size of Y. Now, this is again a range prediction. We do not just say both X and Y have effects that differ from zero. It is still an ordinal prediction, so fits with the basic point of the authors about how we predict, but it no longer fits with their argument that we simply test for significance. Ordinal predictions can be more complex than the authors currently describe. To make a solid contribution they will need to address what ordinal predictions are in practice. With the space that is available after the anecdotes are removed, they can add a real analysis of how we test hypotheses in general, where range predictions fit with ordinal predictions, and if we would use the same measures and have some widely used paradigms, we could, if we wanted to, create theories that make quantifiable range predictions. I agree with the authors it can be perfectly valid to choose a unique operationalization of a test, and that this allows you to increase or decrease the effect size depending on the operationalization. This is true. But we can make theories that predict things beyond just any effect if we fix our paradigms and measures – and authors should discuss if this might be desirable to give their own argument a bit more oomph and credibility. If the authors want to argue that standard measures in psychology are undesirable or impossible, that might be interesting, but I doubt it will work. And thus, I expect author will need to give more credit to the power of ordinal predictions. In essence, my point here is that if you think we can not make a range prediction on a standardized measure, you also think there can be no prediction that condition X yields a larger effect than condition y, and yet we make these predictions all the time. Again, in a Stroop effect with 10 trials effect sizes differ than is we have 10000 trials – but given a standardized measure, we can predict relative differences.

3) Standardized vs unstandardized effect sizes

I might be a bit too rigid here, but when scientists make claims, I like them to be accompanied by evidence. The authors write “Finally, it is commonly believed that in the case of arbitrary units, a standardized effect size is more meaningful and informative than its equivalent in raw score units.” There is no citation and this to me sounds 100% incorrect. I am sure they might be able to dig out one misguided paper making this claim. But this is not commonly believed, and the literature abundantly shows researchers argue the opposite – the most salient example is (Baguley, 2009) but any stats book would suffice. The lack of a citation to Baguley is just one of the examples where the authors seem not to be up to speed of the state of the art, and where their message is not nuanced enough, while the discussion in the literature surpassed many of their simple claims more than a decade ago. I think the authors should improve their discussion of standardized and unstandardized effect sizes. Standardized effect sizes are useful of measurement tools differ, and if you have little understanding of what you are measuring. Although I think this is true in general in social psychology (and the measurement crisis is real), I think the authors are not making the point that *given* that social psychology is such a mess when it comes to how researchers in the field measure things, we can not make theoretically quantifiable predictions. I would agree with this. I think they try to argue that even if social psychology was not such a mess, we could still not make quantifiable predictions. I disagree. Issues related to the standardized and unstandardized effect sizes are a red herring. They do not matter anything. If we understood our measures and standardized them, we would have accurate estimates of the sd’s for what we are interested in, and this whole section can just be deleted. The authors should be clear if they think we will never standardize our measures and there is no value in them or if it is just difficult in practice right now. Regardless, they issue with standardized effects is mute, since their first sentence that standardized effect sizes are more meaningful is just wrong (for a discussion, Lakens, 2013).

Minor points

When discussing Festinger and Carlsmith, it makes sense to point out how low quality and riddled with mistakes the study was: https://mattiheino.com/2016/11/13/legacy-of-psychology/.
The authors use the first studies of several classic research lines as an example that psychology predicts directional effects at best, and that these studies cared about demonstrating an effect. Are the authors sure their idea of scientific progress is that we for ever limit ourselves to demonstrating effects? This is criticized in many research fields, and the idea of developing computational models in in some domains deserves to be mentioned. Even a simple stupid model can make range predictions that can be tested theoretically. A broader discussion of psychologists who have a bit more ambition for social psychology than the current authors, and who believe that some progress towards even a rough computational model would allow us to predict not just ranges, but also the shapes of effects (e.g., linear vs exponential effects) would be warranted, I think. I think it is fine if the authors have the opinion that social psychology will not move along by more quantification. But I find the final paragraph a bit vague and uninspiring in what the vision is. No one argues against practical applications or conceptual replications. The authors rightly note it is easier (although I think not as much easier as the authors think) to use effects in cost-benefit analyses in applied research. But what is the vision? Demonstration proofs and an existentialistic leap of faith that we can apply things? That has not worked well. Applied psychological researchers have rightly criticized theoretically focused social psychologists for providing basically completely useless existence proofs that often do not translate to any application, and are too limited to be of any value. I do not know what the solution is here, but I would be curious to hear if the authors have a slightly more ambitious vision. If not, that is fine, but if they have one, I think it would boost the impact of the paper.
Signed,
Daniel Lakens
Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100(3), 603–617. https://doi.org/10.1348/000712608X377117
Fiedler, K. (2004). Tools, toys, truisms, and theories: Some thoughts on the creative cycle of theory formation. Personality and Social Psychology Review, 8(2), 123–131. https://doi.org/10.1207/s15327957pspr0802_5
Glöckner, A. (2016). The irrational hungry judge effect revisited: Simulations reveal that the magnitude of the effect is overestimated. Judgment and Decision Making, 11(6), 601–610.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00863
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

Review of "The Generalizability Crisis" by Tal Yarkoni

A response to this blog by Tal Yarkoni is here.
In a recent preprint titled “The Generalizability Crisis“, Tal Yarkoni examines whether the current practice of how psychologists generalize from studies to theories is problematic. He writes: “The question taken up in this paper is whether or not the tendency to generalize psychology findings far beyond the circumstances in which they were originally established is defensible. The case I lay out in the next few sections is that it is not, and that unsupported generalization lies at the root of many of the methodological and sociological challenges currently affecting psychological science.” We had a long twitter discussion about the paper, and then read it in our reading group. In this review, I try to make my thoughts about the paper clear in one place, which might be useful if we want to continue to discuss whether there is a generalizability crisis, or not.

First, I agree with Yarkoni that almost all the proposals he makes in the section “Where to go from here?” are good suggestions. I don’t think they follow logically from his points about generalizability, as I detail below, but they are nevertheless solid suggestions a researcher should consider. Second, I agree that there are research lines in psychology where modelling more things as random factors will be productive, and a forceful manifesto (even if it is slightly less practical than similar earlier papers) might be a wake up call for people who had ignored this issue until now.

Beyond these two points of agreement, I found the main thesis in his article largely unconvincing. I don’t think there is a generalizability crisis, but the article is a nice illustration of why philosophers like Popper abandoned the idea of an inductive science. When Yarkoni concludes that “A direct implication of the arguments laid out above is that a huge proportion of the quantitative inferences drawn in the published psychology literature are so inductively weak as to be at best questionable and at worst utterly insensible.” I am primarily surprised he believes induction is a defensible philosophy of science. There is a very brief discussion of views by Popper, Meehl, and Mayo on page 19, but their work on testing theories is proposed as a probable not feasible solution – which is peculiar, because these authors would probably disagree with most of the points made by Yarkoni, and I would expect at least somewhere in the paper a discussion comparing induction against the deductive approach (especially since the deductive approach is arguably the dominant approach in psychology, and therefore none of the generalizability issues raised by Yarkoni are a big concern). Because I believe the article starts from a faulty position (scientists are not concerned with induction, but use deductive approaches) and because Yarkoni provides no empirical support for any of his claims that generalizability has led to huge problems (such as incredibly high Type 1 error rates), I remain unconvinced there is anything remotely close to the generalizability crisis he so evocatively argues for. The topic addressed by Yarkoni is very broad. It probably needs a book length treatment to do it justice. My review is already way too long, and I did not get into the finer details of the argument. But I hope this review helps to point out the parts of the manuscript where I feel important arguments lack a solid foundation, and where issues that deserve to be discussed are ignored.

Point 1: “Fast” and “slow” approaches need some grounding in philosophy of science.

Early in the introduction, Yarkoni says there is a “fast” and “slow” approach of drawing general conclusions from specific observations. Whenever people use words that don’t exactly describe what they mean, putting them in quotation marks is generally not a good idea. The “fast” and “slow” approaches he describes are not, I believe upon closer examination, two approaches “of drawing general conclusions from specific observations”.

The difference is actually between induction (the “slow” approach of generalizing from single observations to general observations) and deduction, as proposed by for example Popper. As Popper writes “According to the view that will be put forward here, the method of critically testing theories, and selecting them according to the results of tests, always proceeds on the following lines. From a new idea, put up tentatively, and not yet justified in any way—an anticipation, a hypothesis, a theoretical system, or what you will—conclusions are drawn by means of logical deduction.”

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments’”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the e?ect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.” Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

Point 2: Titles are not evidence for psychologist’s tendency to generalize too quickly.

This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

For me, this observation raised serious concerns about the statement Yarkoni makes that, simply from the titles of scientific articles, we can make a statement about whether authors make ‘fast’ or ‘slow’ generalizations. One reason is that Yarkoni examined titles from a scientific article that adheres to the publication manual of the APA. In the section on titles, the APA states: “A title should summarize the main idea of the manuscript simply and, if possible, with style. It should be a concise statement of the main topic and should identify the variables or theoretical issues under investigation and the relationship between them. An example of a good title is “Effect of Transformed Letters on Reading Speed.””. To me, it seems the authors are simply following the APA publication manual. I do not think their choice for a title provides us with any insight whatsoever about the tendency of authors to have a preference for ‘fast’ generalization. Again, this might be a minor point, but I found this an illustrative example of the strength of arguments in other places (see the next point for the most important example). Yarkoni needs to make a case that scientists are overgeneralizing, for there to be a generalizability crisis – but he does so unconvincingly. I sincerely doubt researchers expect their findings to generalize to all possible situations mentioned in the title, I doubt scientists believe titles are the place to accurately summarize limits of generalizability, and I doubt Yarkoni has made a strong point that psychologists overgeneralize based on this section. More empirical work would be needed to build a convincing case (e.g., code how researchers actually generalize their findings in a random selection of 250 articles, taking into account Gricean communication norms (especially the cooperative principle) in scientific articles).

Point 3: Theories and tests are not perfectly aligned in deductive approaches.

After explaining that psychologists use statistics to test predictions based on experiments that are operationalizations of verbal theories, Yarkoni notes: “From a generalizability standpoint, then, the key question is how closely the verbal and quantitative expressions of one’s hypothesis align with each other.”

Yarkoni writes: “When a researcher verbally expresses a particular hypothesis, she is implicitly defining a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis. If the researcher subsequently asserts that a particular statistical procedure provides a suitable test of the verbal hypothesis, she is making the tacit but critical assumption that the universe of admissible observations implicitly defined by the chosen statistical procedure (in concert with the experimental design, measurement model, etc.) is well aligned with the one implicitly defined by the qualitative hypothesis. Should a discrepancy between the two be discovered, the researcher will then face a choice between (a) working to resolve the discrepancy in some way (i.e., by modifying either the verbal statement of the hypothesis or the quantitative procedure(s) meant to provide an operational parallel); or (b) giving up on the link between the two and accepting that the statistical procedure does not inform the verbal hypothesis in a meaningful way.

I highlighted what I think is the critical point is in a bold font. To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed e?ects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned. As Yarkoni accurately summarizes based on an large multi-lab study on verbal overshadowing by Alogna: “given very conservative background assumptions, the massive Alogna et al. study—an initiative that drew on the efforts of dozens of researchers around the world—does not tell us much about the general phenomenon of verbal overshadowing. Under more realistic assumptions, it tells us essentially nothing.” This is also why Yarkoni’s first practical recommendation on how to move forward is to not solve the problem, but to do something else: “One perfectly reasonable course of action when faced with the difficulty of extracting meaningful, widely generalizable conclusions from e?ects that are inherently complex and highly variable is to opt out of the enterprise entirely.”

This is exactly the reason Popper (among others) rejected induction, and proposed a deductive approach. Why isn’t the alignment between theories and tests raised by Yarkoni a problem for the deductive approach proposed by Popper, Meehl, and Mayo? The reason is that the theory is tentatively posited as true, but in no way believed to be a complete representation of reality. This is an important difference. Yarkoni relies on an inductive approach, and thus the test needs to be aligned with the theory, and the theory defines “a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis.” For deductive approaches, this is not true.

For philosophers of science like Popper and Lakatos, a theory is not a complete description of reality. Lakatos writes about theories: “Each of them, at any stage of its development, has unsolved problems and undigested anomalies. All theories, in this sense, are born refuted and die refuted.” Lakatos gives the example that Newton’s Principia could not even explain the motion of the moon when it was published. The main point here: All theories are wrong. The fact that all theories (or models) are wrong should not be surprising. Box’s quote “All models are wrong, some are useful” is perhaps best known, but I prefer Box (1976) on parsimony: “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William Ockham (1285-1349) he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity (Ockham’s knife).” He follows this up by stating “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

In a deductive approach, the goal of a theoretical model is to make useful predictions. I doubt anyone believes that any of the models they are currently working on is complete. Some researchers might follow an instrumentalist philosophy of science, and don’t expect their theories to be anything more than useful tools. Lakatos’s (1978) main contribution to philosophy of science was to develop a way we deal with our incorrect theories, admitting that all needed adjustment, but some adjustments lead to progressive research lines, and others to degenerative research lines.

In a deductive model, it is perfectly fine to posit a theory that eating ice-cream makes people happy, without assuming this holds for all flavors, across all cultures, at all temperatures, and is irrespective of the amount of ice-cream eaten previously, and many other factors. After all, it is just a tentatively model that we hope is simple enough to be useful, and that we expect to become more complex as we move forward. As we increase our understanding of food preferences, we might be able to modify our theory, so that it is still simple, but also allows us to predict the fact that eggnog and bacon flavoured ice-cream do not increase happiness (on average). The most important thing is that our theory is tentative, and posited to allow us to make good predictions. As long as the theory is useful, and we have no alternatives to replace it with, the theory will continue to be used – without any expectation that is will generalize to all possible situations. As Box (1976) writes: “Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory.” A discussion of this large gap between Yarkoni and deductive approaches proposed by Popper and Meehl, where Yarkoni thinks theories and tests need to align, and deductive approaches see theories as tentative and wrong, should be included, I think. 

Point 4: The dismissal of risky predictions is far from convincing (and generalizability is typically a means to risky predictions, not a goal in itself).

If we read Popper (but also on the statistical side the work of Neyman) we see induction as a possible goal in science is clearly rejected. Yarkoni mentions deductive approaches briefly in his section on adopting better standards, in the sub-section on making riskier predictions. I intuitively expected this section to be crucial – after all, it finally turns to those scholars who would vehemently disagree with most of Yarkoni’s arguments in the preceding sections – but I found this part rather disappointing. Strangely enough, Yarkoni simply proposes predictions as a possible solution – but since the deductive approach goes directly against the inductive approach proposed by Yarkoni, it seems very weird to just mention risky predictions as one possible solution, when it is actually a completely opposite approach that rejects most of what Yarkoni argues for. Yarkoni does not seem to believe that the deductive mode proposed by Popper, Meehl, and Mayo, a hypothesis testing approach that is arguably the dominant approach in most of psychology (Cortina & Dunlap, 1997; Dienes, 2008; Hacking, 1965), has a lot of potential. The reason he doubts severe tests of predictions will be useful is that “in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding” (Yarkoni, p. 19). This could be resolved if risky predictions were possible, which Yarkoni doubts.

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests. It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

Point 5: Why care about statistical inferences, if these do not relate to sweeping verbal conclusions?

If we ignore all points previous points, we can still read Yarkoni’s paper as a call to introduce more random factors in our experiments. This nicely complements recent calls to vary all factors you do not thing should change the conclusions you draw (Baribault et al., 2018), and classic papers on random effects (Barr et al., 2013; Clark, 1969; Cornfield & Tukey, 1956).

Yarkoni generalizes from the fact that most scientists model subjects as a random factor, and then asks why scientists generalize to all sorts of other factors that were not in their models. He asks “Why not simply model all experimental factors, including subjects, as fixed e?ects”. It might be worth noting in the paper that sometimes researchers model subjects as fixed effects. For example, Fujisaki and Nishida (2009) write: “Participants were the two authors and five paid volunteers” and nowhere in their analyses do they assume there is any meaningful or important variation across individuals. In many perception studies, an eye is an eye, and an ear is an ear – whether from the author, or a random participant dragged into the lab from the corridor.

In other research areas, we do model individuals as a random factor. Yarkoni says we model stimuli as a random factor because: “The reason we model subjects as random e?ects is not that such a practice is objectively better, but rather, that this specification more closely aligns the meaning of the quantitative inference with the meaning of the qualitative hypothesis we’re interested in evaluating”. I disagree. I think we model certain factor as random effects because we have a high prior these factors influence the effect, and leaving them out of the model would reduce the strength of our prediction. Leaving them out reduces the probability a test will show we are wrong, if we are wrong. It impacts the severity of the test. Whether or not we need to model factors (e.g., temperature, the experimenter, or day of the week) as random factors because not doing so reduces the severity of a test is a subjective judgments. Research fields need to decide for themselves. It is very well possible more random factors are generally needed, but I don’t know how many, and doubt it will ever be as severe are the ‘generalizability crisis’ suggests. If it is as severe as Yarkoni suggests, some empirical demonstrations of this would be nice. Clark (1973) showed his language-as-fixed-effect fallacy using real data. Barr et al (2013) similarly made their point based on real data. I currently do not find the theoretical point very strong, but real data might convince me otherwise.

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic. Similarly, Cornfield & Tukey (1956) more pragmatically list options ranging from ignoring factors altogether, to randomizing them, or including them as a factor, and note “Each of these attitudes is appropriate in its place. In every experiment there are many variables which could enter, and one of the great skills of the experimenter lies in leaving out only inessential ones.” Just as pragmatically, Clark (1973) writes: “The wide-spread capitulation to the language-as-fixed-effect fallacy, though alarming, has probably not been disastrous. In the older established areas, most experienced investigators have acquired a good feel for what will replicate on a new language sample and what will not. They then design their experiments accordingly.” As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why. In some ways, Yarkoni’s point generalizes the argument that most findings in psychology do not generalize to non-WEIRD populations (Henrich et al., 2010), and it has the same weakness. WEIRD is a nice acronym, but it is just a completely random collection of 5 factors that might limit generalizability. The WEIRD acronym functions more as a nice reminder that boundary conditions exist, but it does not allow us to predict when they exist, or when they matter enough to be included in our theories. Currently, there is a gap between the factors that in theory could matter, and the factors that we should in practice incorporate. Maybe it is my pragmatic nature, but without such a discussion, I think the paper offers relatively little progress compared to previous discussions about generalizability (of which there are plenty).

Conclusion

A large part of Yarkoni’s argument is based on the fact that theories and tests should be closely aligned, while in a deductive approach based on severe tests of predictions, models are seen as simple, tentative, and wrong, and this is not considered a problem. Yarkoni does not convincingly argue researchers want to generalize extremely broadly (although I agree papers would benefit from including Constraints on Generalizability statements a proposed by Simons and colleagues (2017), but mainly because this improves falsifiability, not because it improves induction), and even if there is the tendency to overclaim in articles, I do not think this leads to an inferential crisis. Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice. Until Yarkoni does the latter convincingly, I don’t think the generalizability crisis as he sketches it is something that will keep me up at night.

References

Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Ravenzwaaij, D. van, White, C. N., Boeck, P. D., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3). https://doi.org/10.1016/j.jml.2012.11.001

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10/gdm28w

Clark, H. H. (1969). Linguistic processes in deductive reasoning. Psychological Review, 76(4), 387–404. https://doi.org/10.1037/h0027578

Cornfield, J., & Tukey, J. W. (1956). Average Values of Mean Squares in Factorials. The Annals of Mathematical Statistics, 27(4), 907–949. https://doi.org/10.1214/aoms/1177728067

Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161.

Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Palgrave Macmillan.

Fujisaki, W., & Nishida, S. (2009). Audio–tactile superiority over visuo–tactile and audio–visual combinations in the temporal resolution of synchrony perception. Experimental Brain Research, 198(2), 245–259. https://doi.org/10.1007/s00221-009-1870-x

Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29–29.

Lakens, D. (2020). The Value of Preregistration for Psychological Science: A Conceptual Analysis. Japanese Psychological Review. https://doi.org/10.31234/osf.io/jbh4w

Munafò, M. R., & Smith, G. D. (2018). Robust research needs many lines of evidence. Nature, 553(7689), 399–401. https://doi.org/10.1038/d41586-018-01023-3

Orben, A., & Lakens, D. (2019). Crud (Re)defined. https://doi.org/10.31234/osf.io/96dpy

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on Generality (COG): A Proposed Addition to All Empirical Papers. Perspectives on Psychological Science, 12(6), 1123–1128. https://doi.org/10.1177/1745691617708630

Observed Type 1 Error Rates (Why Statistical Models are Not Reality)

“In the long run we are all dead.” – John Maynard Keynes
When we perform hypothesis tests in a Neyman-Pearson framework we want to make decisions while controlling the rate at which we make errors. We do this in part by setting an alpha level that guarantees we will not say there is an effect when there is no effect more than ?% of the time, in the long run.
I like my statistics applied. And in practice I don’t do an infinite number of studies. As Keynes astutely observed, I will be dead before then. So when I control the error rate for my studies, what is a realistic Type 1 error rate I will observe in the ‘somewhat longer run’?
Let’s assume you publish a paper that contains only a single p-value. Let’s also assume the true effect size is 0, so the null hypothesis is true. Your test will return a p-value smaller than your alpha level (and this would be a Type 1 error) or not. With a single study, you don’t have the granularity to talk about a 5% error rate.

In experimental psychology 30 seems to be a reasonable average for the number of p-values that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872). Let’s assume you perform 30 tests in a single paper and every time the null is true (even though this is often unlikely in a real paper). In the long run, with an alpha level of 0.05 we can expect that 30 * 0.05 = 1.5 p-values will be significant. But in real sets of 30 p-values there is no half of a p-value, so you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more. We can plot the frequency of Type 1 error rates for 1 million sets of 30 tests.

Each of these error rates occurs with a certain frequency. 21.5% of the time, you will not make any Type 1 errors. 12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but no single set of studies is average.

Now maybe a single paper with 30 tests is not ‘long runnerish’ enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let’s assume we are interested in a meta-analysis worth of 200 studies that examine a topic where the true effect size is 0 for each test. We can plot the frequency of Type 1 error rates for 1 million sets of 200 tests.
 


Now things start to look a bit more like what you would expect. The Type 1 error rate you will observe in your set of 200 tests is close to 5%. However, it is almost exactly as likely that the observed Type 1 error rate is 4.5%. 90% of the distribution of observed alpha levels will lie between 0.025 and 0.075. So, even in ‘somewhat longrunnish’ 200 tests, the observed Type 1 error rate will rarely be exactly 5%, and it might be more useful to think about it as being between 2.5 and 7.5%.

Statistical models are not reality.

A 5% error rate exists only in the abstract world of infinite repetitions, and you will not live long enough to perform an infinite number of studies. In practice, if you (or a group of researchers examining a specific question) do real research, the error rates are somewhere in the range of 5%. Everything has variation in samples drawn from a larger population – error rates are no exception.
When we quantify things, there is the tendency to get lost in digits. But in practice, the levels of random noise we can reasonable expect quickly overwhelms everything at least 3 digits after the decimal. I know we can compute the alpha level after a Pocock correction for two looks at the data in sequential analyses as 0.0294. But this is not the level of granularity that we should have in mind when we think of the error rate we will observe in real lines of research. When we control our error rates, we do so with the goal to end up somewhere reasonably low, after a decent number of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors or 7.5% errors: Potato, patato.
This does not mean we should stop quantifying numbers precisely when they can be quantified precisely, but we should realize what we get from the statistical procedures we use. We don’t get a 5% Type 1 error rate in any real set of studies we will actually perform. Statistical inferences guide us roughly to where we would ideally like to end up. By all means calculate exact numbers where you can. Strictly adhere to hard thresholds to prevent you from fooling yourself too often. But maybe in 2020 we can learn to appreciate statistical inferences are always a bit messy. Do the best you reasonably can, but don’t expect perfection. In 2020, and in statistics.

Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers.