Jerzy Neyman: A Positive Role Model in the History of Frequentist Statistics

Many of the facts in this blog post come from the biography ‘Neyman’ by Constance Reid. I highly recommend reading this book if you find this blog interesting.

In recent years researchers have become increasingly interested in the relationship between eugenics and statistics, especially focusing on the lives of Francis Galton, Karl Pearson, and Ronald Fisher. Some have gone as far as to argue for a causal relationship between eugenics and frequentist statistics. For example, in a recent book ‘Bernouilli’s Fallacy’, Aubrey Clayton speculates that Fisher’s decision to reject prior probabilities and embrace a frequentist approach was “also at least partly political”. Rejecting prior probabilities, Clayton argues, makes science seem more ‘objective’, which would have helped Ronald Fisher and his predecessors to establish eugenics as a scientific discipline, despite the often-racist conclusions eugenicists reached in their work.

When I was asked to review an early version of Clayton’s book for Columbia University Press, I thought that the main narrative was rather unconvincing, and thought the presented history of frequentist statistics was too one-sided and biased. Authors who link statistics to problematic political views often do not mention equally important figures in the history of frequentist statistics who were in all ways the opposite of Ronald Fisher. In this blog post, I want to briefly discuss the work and life of Jerzy Neyman, for two reasons.


Jerzy Neyman (image from https://statistics.berkeley.edu/people/jerzy-neyman)

First, the focus on Fisher’s role in the history of frequentist statistics is surprising, given that the dominant approach to frequentist statistics used in many scientific disciplines is the Neyman-Pearson approach. If you have ever rejected a null hypothesis because a p-value was smaller than an alpha level, or if you have performed a power analysis, you have used the Neyman-Pearson approach to frequentist statistics, and not the Fisherian approach. Neyman and Fisher disagreed vehemently about their statistical philosophies (in 1961 Neyman published an article titled ‘Silver Jubilee of My Dispute with Fisher’), but it was Neyman’s philosophy that won out and became the default approach to hypothesis testing in most fields[i]. Anyone discussing the history of frequentist hypothesis testing should therefore seriously engage with the work of Jerzy Neyman and Egon Pearson. Their work was not in line with the views of Karl Pearson, Egon’s father, nor the views of Fisher. Indeed, it was a great source of satisfaction to Neyman that their seminal 1933 paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of the work, and (as Neyman thought) reviewed by Fisher[ii], who strongly disagreed with their philosophy of statistics.

Second, Jerzy Neyman was also the opposite to Fisher in his political viewpoints. Instead of promoting eugenics, Neyman worked to improve the position of those less privileged throughout his life, teaching disadvantaged people in Poland, and creating educational opportunities for Americans at UC Berkeley. He hired David Blackwell, who was the first Black tenured faculty member at UC Berkeley. This is important, because it falsifies the idea put forward by Clayton[iii]that frequentist statistics became the dominant approach in science because the most important scientists who worked on it wanted to pretend their dubious viewpoints were based on ‘objective’ scientific methods.  

I think it is useful to broaden the discussion of the history of statistics, beyond the work by Fisher and Karl Pearson, and credit the work of others[iv]who contributed in at least as important ways to the statistics we use today. I am continually surprised about how few people working outside of statistics even know the name of Jerzy Neyman, even though they regularly use his insights when testing hypotheses. In this blog, I will try to describe his work and life to add some balance to the history of statistics that most people seem to learn about. And more importantly, I hope Jerzy Neyman can be a positive role-model for young frequentist statisticians, who might so far have only been educated about the life of Ronald Fisher.


Neyman’s personal life


Neyman was born in 1984 in Russia, but raised in Poland. After attending the gymnasium, he studied at the University of Kharkov. Initially trying to become an experimental physicist, he was too clumsy with his hands, and switched to conceptual mathematics, in which he concluded his undergraduate in 1917 in politically tumultuous times. In 1919 he met his wife, and they marry in 1920. Ten days later, because of the war between Russia and Poland, Neyman is imprisoned for a short time, and in 1921 flees to a small village to avoid being arrested again, where he obtains food by teaching the children of farmers. He worked for the Agricultural Institute, and then worked at the University in Warsaw. He obtained his doctor’s degree in 1924 at age 30. In September 1925 he was sent to London for a year to learn about the latest developments in statistics from Karl Pearson himself. It is here that he met Egon Pearson, Karl’s son, and a friendship and scientific collaboration starts.

Neyman always spends a lot of time teaching, often at the expense of doing scientific work. He was involved in equal opportunity education in 1918 in Poland, teaching in dimly lit classrooms where the rag he used to wipe the blackboard would sometimes freeze. He always had a weak spot for intellectuals from ‘disadvantaged’ backgrounds. He and his wife were themselves very poor until he moved to UC Berkeley in 1938. In 1929, back in Poland, his wife becomes ill due to their bad living conditions, and the doctor who comes to examine her is so struck by their miserable living conditions he offers the couple stay in his house for the same rent they were paying while he visits France for 6 months. In his letters to Egon Pearson from this time, he often complained that the struggle for existence takes all his time and energy, and that he can not do any scientific work.

Even much later in his life, in 1978, he kept in mind that many people have very little money, and he calls ahead to restaurants to make sure a dinner before a seminar would not cost too much for the students. It is perhaps no surprise that most of his students (and he had many) talk about Neyman with a lot of appreciation. He wasn’t perfect (for example, Erich Lehmann – one of Neyman’s students – remarks how he was no longer allowed to teach a class after Lehmann’s notes, building on but extending the work by Neyman, became extremely popular – suggesting Neyman was no stranger to envy). But his students were extremely positive about the atmosphere he created in his lab. For example, job applicants were told around 1947 that “there is no discrimination on the basis of age, sex, or race … authors of joint papers are always listed alphabetically.”

Neyman himself often suffered discrimination, sometimes because of his difficulty mastering the English language, sometimes for being Polish (when in Paris a piece of clothing, and ermine wrap, is stolen from their room, the police responds “What can you expect – only Poles live there!”), sometimes because he did not believe in God, and sometimes because his wife was Russian and very emancipated (living independently in Paris as an artist). He was fiercely against discrimination. In 1933, as anti-Semitism is on the rise among students at the university where he works in Poland, he complains to Egon Pearson in a letter that the students are behaving with Jews as Americans do with people of color. In 1941 at UC Berkeley he hired women at a time it was not easy for a woman to get a job in mathematics.  

In 1942, Neyman examined the possibility of hiring David Blackwell, a Black statistician, then still a student. Neyman met him in New York (so that Blackwell does not need to travel to Berkeley at his own expense) and considered Blackwell the best candidate for the job. The wife of a mathematics professor (who was born in the south of the US) learned about the possibility that a Black statistician might be hired, warns she will not invite a Black man to her house, and there was enough concern for the effect the hire would have on the department that Neyman can not make an offer to Blackwell. He is able to get Blackwell to Berkeley in 1953 as a visiting professor, and offers him a tenured job in 1954, making David Blackwell the first tenured faculty member at the University of Berkeley, California. And Neyman did this, even though Blackwell was a Bayesian[v];).

In 1963, Neyman travelled to the south of the US and for the first time directly experienced the segregation. Back in Berkeley, a letter is written with a request for contributions for the Southern Christian Leadership Conference (founded by Martin Luther King, Jr. and others), and 4000 copies are printed and shared with colleagues at the university and friends around the country, which brought in more than $3000. He wrote a letter to his friend Harald Cramér that he believed Martin Luther King, Jr. deserved a Nobel Peace Prize (which Cramér forwarded to the chairman of the Nobel Committee, and which he believed might have contributed at least a tiny bit to fact that Martin Luther King, Jr. was awarded the Nobel Prize a year later). Neyman also worked towards the establishment of a Special Scholarships Committee at UC Berkeley with the goal of providing education opportunities to disadvantaged Americans

Neyman was not a pacifist. In the second world war he actively looked for ways he could contribute to the war effort. He is involved in statistical models that compute the optimal spacing of bombs by planes to clear a path across a beach of land mines. (When at a certain moment he needs specifics about the beach, a representative from the military who is not allowed to directly provide this information asks if Neyman has ever been to the seashore in France, to which Neyman replies he has been to Normandy, and the representative answers “Then use that beach!”). But Neyman early and actively opposed the Vietnam war, despite the risk of losing lucrative contracts the Statistical Laboratory had with the Department of Defense. In 1964 he joined a group of people who bought advertisements in local newspapers with a picture of a napalmed Vietnamese child with the quote “The American people will bluntly and plainly call it murder”.


A positive role model


It is important to know the history of a scientific discipline. Histories are complex, and we should resist overly simplistic narratives. If your teacher explains frequentist statistics to you, it is good if they highlight that someone like Fisher had questionable ideas about eugenics. But the early developments in frequentist statistics involved many researchers beyond Fisher, and, luckily, there are many more positive role-models that also deserve to be mentioned – such as Jerzy Neyman. Even though Neyman’s philosophy on statistical inferences forms the basis of how many scientists nowadays test hypotheses, his contributions and personal life are still often not discussed in histories of statistics – an oversight I hope the current blog post can somewhat mitigate. If you want to learn more about the history of statistics through Neyman’s personal life, I highly recommend the biography of Neyman by Constance Reid, which was the source for most of the content of this blog post.

 



[i] See Hacking, 1965: “The mature theory of Neyman and Pearson is very nearly the received theory on testing statistical hypotheses.”

[ii] It turns out, in the biography, that it was not Fisher, but A. C. Aitken, who reviewed the paper positively.

[iii] Clayton’s book seems to be mainly intended as an attempt to persuade readers to become a Bayesian, and not as an accurate analysis of the development of frequentist statistics.

[iv] William Gosset (or ‘Student’, from ‘Student’s t-test’), who was the main inspiration for the work by Neyman and Pearson, is another giant in frequentist statistics who does not in any way fit into the narrative that frequentist statistics is tied to eugenics, as his statistical work was motivated by applied research questions in the Guinness brewery. Gosset was a modest man – which is probably why he rarely receives the credit he is due.

[v] When asked about his attitude towards Bayesian statistics in 1979, he answered: “It does not interest me. I am interested in frequencies.” He did note multiple legitimate approaches to statistics exist, and the choice one makes is largely a matter of personal taste. Neyman opposed subjective Bayesian statistics because their use could lead to bad decision procedures, but was very positive about later work by Wald, which inspired Bayesian statistical decision theory.

 

What’s a family in family-wise error control?

When you perform multiple comparisons in a study, you need to control your alpha level for multiple comparisons. It is generally recommended to control for the family-wise error rate, but there is some confusion about what a ‘family’ is. As Bretz, Hothorn, & Westfall (2011) write in their excellent book “Multiple Comparisons Using R” on page 15: “The appropriate choice of null hypotheses being of primary interest is a controversial question. That is, it is not always clear which set of hypotheses should constitute the family H1,…,Hm. This topic has often been in dispute and there is no general consensus.” In one of the best papers on controlling for multiple comparisons out there, Bender & Lange (2001) write: “Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions. In addition to the problem of deciding which error rate should be under control, it has to be defined first which tests of a study belong to one experiment.” The Wikipedia page on family-wise error rate is a mess.

I will be honest: I have never understood this confusion about what a family of tests is when controlling the family-wise error rate. At least not in a Neyman-Pearson approach to hypothesis testing, where the goal is to use data to make decisions about how to act. Neyman (Neyman, 1957) calls his approach inductive behavior. The outcome of an experiment leads one to take different possible actions, which can be either practical (e.g., implement a new procedure, abandon a research line) or scientific (e.g., claim there is or is no effect). From an error-statistical approach (Mayo, 2018) inflated Type 1 error rates mean that it has become very likely that you will be able to claim support for your hypothesis, even when the hypothesis is wrong. This reduces the severity of the test. To prevent this, we need to control our error rate at the level of our claim.
One reason the issue of family-wise error rates might remain vague, is that researchers are often vague about their claims. We do not specify our hypotheses unambiguously, and therefore this issue remains unclear. To be honest, I suspect another reason there is a continuing debate about whether and how to lower the alpha level to control for multiple comparisons in some disciplines is that 1) there are a surprisingly large number of papers written on this topic that argue you do not need to control for multiple comparisons, which are 2) cited a huge number of times giving rise to the feeling that surely they must have a point. Regrettably, the main reason these papers are written is because there are people who don’t think a Neyman-Pearson approach to hypothesis testing is a good idea, and the main reason these papers are cited is because doing so is convenient for researchers who want to publish statistically significant results, as they can justify why they are not lowering their alpha level, making that p = 0.02 in one of three tests really ‘significant’. All papers that argue against the need to control for multiple comparisons when testing hypotheses are wrong.  Yes, their existence and massive citation counts frustrate me. It is fine not to test a hypothesis, but when you do, and you make a claim based on a test, you need to control your error rates. 

But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.

An example of formalizing family-wise error control

Let’s assume we collect data from 100 participants in a control and treatment condition. We collect 3 dependent variables (dv1, dv2, and dv3). In the population there is no difference between groups on any of these three variables (the true effect size is 0). We will analyze the three dv’s in independent t-tests. This requires specifying our alpha level, and thus deciding whether we need to correct for multiple comparisons. How we control error rates depends on claim we want to make.
We might want to act as if (or claim that) our treatment works if there is a difference between the treatment and control conditions on any of the three variables. In scienceverse terms, this means we consider the prediction corroborated when the p-value of the first t-test is smaller than alpha level, the p-value of the second t-test is smaller than the alpha level, or the p-value of the first t-test is smaller than the alpha level. In the scienceverse code, we specify a criterion for each test (a p-value smaller than the alpha level, p.value < alpha_level) and conclude the hypothesis is corroborated if either of these criteria are met (“p_t_1 | p_t_2 | p_t_3”).  
We could also want to make three different predictions. Instead of one hypothesis (“something will happen”) we have three different hypotheses, and predict there will be an effect on dv1, dv2, and dv3. The criterion for each t-test is the same, but we now have three hypotheses to evaluate (H1, H2, and H3). Each of these claims can be corroborated, or not.
Scienceverse allows you to specify your hypotheses tests unambiguously (for code used in this blog, see the bottom of the post). It also allows you to simulate a dataset, which we can use to examine Type 1 errors by simulating data where no true effects exist. Finally, scienceverse allows you to run the pre-specified analyses on the (simulated) data, and will automatically create a report that summarizes which hypotheses were corroborated (which is useful when checking if the conclusions in a manuscript indeed follow from the preregistered analyses, or not). The output a single simulated dataset for the scenario where we will interpret any effect on the three dv’s as support for the hypothesis looks like this:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

Something will happen

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if anything is significant.

 p_t_1 | p_t_2 | p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if nothing is significant.

 !p_t_1 & !p_t_2 & !p_t_3 

All criteria were met for corroboration.

We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 – even though this was a Type 1 error, since we simulated data with a true effect size of 0 – and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:

Evaluation of Statistical Hypotheses

12 March, 2020

Simulating Null Effects Postregistration

Results

Hypothesis 1: H1

dv1 will show an effect

  • p_t_1 is confirmed if analysis ttest_1 yields p.value<0.05

    The result was p.value = 0.452 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv1 is significant.

 p_t_1 

Falsification ( TRUE )

The hypothesis is falsified if dv1 is not significant.

 !p_t_1 

All criteria were met for falsification.

Hypothesis 2: H2

dv2 will show an effect

  • p_t_2 is confirmed if analysis ttest_2 yields p.value<0.05

    The result was p.value = 0.21 (FALSE)

Corroboration ( FALSE )

The hypothesis is corroborated if dv2 is significant.

 p_t_2 

Falsification ( TRUE )

The hypothesis is falsified if dv2 is not significant.

 !p_t_2 

All criteria were met for falsification.

Hypothesis 3: H3

dv3 will show an effect

  • p_t_3 is confirmed if analysis ttest_3 yields p.value<0.05

    The result was p.value = 0.02 (TRUE)

Corroboration ( TRUE )

The hypothesis is corroborated if dv3 is significant.

 p_t_3 

Falsification ( FALSE )

The hypothesis is falsified if dv3 is not significant.

 !p_t_3 

All criteria were met for corroboration.

We now see that two hypotheses were falsified (yes, yes, I know you should not use p > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don’t also have to explain equivalence testing to readers not familiar with it – if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.

It might seem that the second approach, specifying each dv as it’s own hypothesis, is the way to go if you do not want to lower the alpha level to control for multiple comparisons. But take a look at the report of the study you have performed. You have made 3 predictions, of which 1 was corroborated. That is not an impressive success rate. Sure, mixed results happen, and you should interpret results not just based on the p-value (but on the strength of the experimental design, assumptions about power, your prior, the strength of the theory, etc.) but if these predictions were derived from the same theory, this set of results is not particularly impressive. Since researchers can never selectively report only those results that ‘work’ because this would be a violation of the code of research integrity, we should always be able to see the meager track record of predictions.If you don’t feel ready to make a specific predictions (and run the risk of sullying your track record) either do unplanned exploratory tests, and do not make claims based on their results, or preregister all possible tests you can think of, and massively lower your alpha level to control error rates (for example, genome-wide association studies sometimes use an alpha level of 5 x 10–8 to control the Type 1 erorr rate).

Hopefully, specifying our hypotheses (and what would corroborate them) transparently by using scienceverse makes it clear what happens in the long run in both scenarios. In the long run, both the first scenario, if we would use an alpha level of 0.05/3 instead of 0.05, and the second scenario, with an alpha level of 0.05 for each individual hypothesis, will lead to the same end result: Not more than 5% of our claims will be wrong, if the null hypothesis is true. In the first scenario, we are making one claim in an experiment, and in the second we make three. In the second scenario we will end up with more false claims in an absolute sense, but the relative number of false claims is the same in both scenarios. And that’s exactly the goal of family-wise error control.
References
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54(4), 343–349.
Bretz, F., Hothorn, T., & Westfall, P. H. (2011). Multiple comparisons using R. CRC Press.
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.
Neyman, J. (1957). “Inductive Behavior” as a Basic Concept of Philosophy of Science. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3), 7. https://doi.org/10.2307/1401671

Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.


Observed Type 1 Error Rates (Why Statistical Models are Not Reality)

“In the long run we are all dead.” – John Maynard Keynes
When we perform hypothesis tests in a Neyman-Pearson framework we want to make decisions while controlling the rate at which we make errors. We do this in part by setting an alpha level that guarantees we will not say there is an effect when there is no effect more than ?% of the time, in the long run.
I like my statistics applied. And in practice I don’t do an infinite number of studies. As Keynes astutely observed, I will be dead before then. So when I control the error rate for my studies, what is a realistic Type 1 error rate I will observe in the ‘somewhat longer run’?
Let’s assume you publish a paper that contains only a single p-value. Let’s also assume the true effect size is 0, so the null hypothesis is true. Your test will return a p-value smaller than your alpha level (and this would be a Type 1 error) or not. With a single study, you don’t have the granularity to talk about a 5% error rate.

In experimental psychology 30 seems to be a reasonable average for the number of p-values that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872). Let’s assume you perform 30 tests in a single paper and every time the null is true (even though this is often unlikely in a real paper). In the long run, with an alpha level of 0.05 we can expect that 30 * 0.05 = 1.5 p-values will be significant. But in real sets of 30 p-values there is no half of a p-value, so you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more. We can plot the frequency of Type 1 error rates for 1 million sets of 30 tests.

Each of these error rates occurs with a certain frequency. 21.5% of the time, you will not make any Type 1 errors. 12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but no single set of studies is average.

Now maybe a single paper with 30 tests is not ‘long runnerish’ enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let’s assume we are interested in a meta-analysis worth of 200 studies that examine a topic where the true effect size is 0 for each test. We can plot the frequency of Type 1 error rates for 1 million sets of 200 tests.
 


Now things start to look a bit more like what you would expect. The Type 1 error rate you will observe in your set of 200 tests is close to 5%. However, it is almost exactly as likely that the observed Type 1 error rate is 4.5%. 90% of the distribution of observed alpha levels will lie between 0.025 and 0.075. So, even in ‘somewhat longrunnish’ 200 tests, the observed Type 1 error rate will rarely be exactly 5%, and it might be more useful to think about it as being between 2.5 and 7.5%.

Statistical models are not reality.

A 5% error rate exists only in the abstract world of infinite repetitions, and you will not live long enough to perform an infinite number of studies. In practice, if you (or a group of researchers examining a specific question) do real research, the error rates are somewhere in the range of 5%. Everything has variation in samples drawn from a larger population – error rates are no exception.
When we quantify things, there is the tendency to get lost in digits. But in practice, the levels of random noise we can reasonable expect quickly overwhelms everything at least 3 digits after the decimal. I know we can compute the alpha level after a Pocock correction for two looks at the data in sequential analyses as 0.0294. But this is not the level of granularity that we should have in mind when we think of the error rate we will observe in real lines of research. When we control our error rates, we do so with the goal to end up somewhere reasonably low, after a decent number of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors or 7.5% errors: Potato, patato.
This does not mean we should stop quantifying numbers precisely when they can be quantified precisely, but we should realize what we get from the statistical procedures we use. We don’t get a 5% Type 1 error rate in any real set of studies we will actually perform. Statistical inferences guide us roughly to where we would ideally like to end up. By all means calculate exact numbers where you can. Strictly adhere to hard thresholds to prevent you from fooling yourself too often. But maybe in 2020 we can learn to appreciate statistical inferences are always a bit messy. Do the best you reasonably can, but don’t expect perfection. In 2020, and in statistics.

Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers. 

Do You Really Want to Test a Hypothesis?

I’ve uploaded one of my favorite lectures in the my new MOOC “Improving Your Statistical Questions” to YouTube. It asks the question whether you really want to test a hypothesis. A hypothesis is a very specific tool to answer a very specific question. I like hypothesis tests, because in experimental psychology it is common to perform lines of research where you can design a bunch of studies that test simple predictions about the presence or absence of differences on some measure. I think they have a role to play in science. I also think hypothesis testing is widely overused. As we are starting to do hypothesis tests better (e.g., by preregisteringour predictions and controlling our error rates in more severe tests) I predict many people will start to feel a bit squeamish as they become aware that doing hypothesis tests as they were originally designed to be used isn’t really want they want in their research. One of the often overlooked gains in teaching people how to do something well, is that they finally realize that they actually don’t want to do it.
The lecture “Do You Really Want to Test a Hypothesis” aims to explain which question a hypothesis tests asks, and discusses when a hypothesis tests answers a question you are interested in. It is very easy to say what not to do, or to point out what is wrong with statistical tools. Statistical tools are very limited, even under ideal circumstances. It’s more difficult to say what you can do. If you follow my work, you know that this latter question is what I spend my time on. Instead of telling you optional stopping can’t be done because it is p-hacking, I explain how you can do it correctly through sequential analysis. Instead of telling you it is wrong to conclude the absence of an effect from p > 0.05, I explain how to use equivalence testing­­. Instead of telling you p-values are the devil, I explain how they answer a question you might be interested in when used well. Instead of saying preregistration is redundant, I explain from which philosophy of science preregistration has value. And instead of saying we should abandon hypothesis tests, I try to explain in this video how to use them wisely. This is all part of my ongoing #JustifyEverything educational tour. I think it is a reasonable expectation that researchers should be able to answer at least a simple ‘why’ question if you ask why they use a specific tool, or use a tool in a specific manner.
This might help to move beyond the simplistic discussion I often see about these topics. If you ask me if I prefer frequentist of Bayesian statistics, or confirmatory or exploratory research, I am most likely to respond ? (see Wikipedia). It is tempting to think about these topics in a polarized either-or mindset – but then you would miss asking the real questions. When would any approach give you meaningful insights? Just as not every hypothesis test is an answer to a meaningful question, so will not every exploratory study provide interesting insights. The most important question to ask yourself when you plan a study is ‘when will the tools you use lead to interesting insights’? In the second week of my MOOC I discuss when effects in hypothesis tests could be deemed meaningful, but the same question applies to exploratory or descriptive research. Not all exploration is interesting, and we don’t want to simply describe every property of the world. Again, it is easy to dismiss any approach to knowledge generation, but it is so much more interesting to think about which tools willlead to interesting insights. And above all, realize that in most research lines, researchers will have a diverse set of questions that they want to answer given practical limitations, and they will need to rely on a diverse set of tools, limitations and all.
In this lecture I try to explain what the three limitations are of hypothesis tests, and the very specific question they try to answer. If you like to think about how to improve your statistical questions, you might be interested in enrolling in my free MOOC Improving Your Statistical Questions”.

Justify Your Alpha by Decreasing Alpha Levels as a Function of the Sample Size

A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends and improves the ideas in this blog post is available at: https://psyarxiv.com/ts4r6  
 
Testing whether observed data should surprise us, under the assumption that some model of the data is true, is a widely used procedure in psychological science. Tests against a null model, or against the smallest effect size of interest for an equivalence test, can guide your decisions to continue or abandon research lines. Seeing whether a p-value is smaller than an alpha level is rarely the only thing you want to do, but especially early on in experimental research lines where you can randomly assign participants to conditions, it can be a useful thing.

Regrettably, this procedure is performed rather mindlessly. Doing Neyman-Pearson hypothesis testing well, you should carefully think about the error rates you find acceptable. How often do you want to miss the smallest effect size you care about, if it is really there? And how often do you want to say there is an effect, but actually be wrong? It is important to justify your error rates when designing an experiment. In this post I will provide one justification for setting the alpha level (something we recommended makes more sense than using a fixed alpha level).

Papers explaining how to justify your alpha level are very rare (for an example, see Mudge, Baker, Edge, & Houlahan, 2012). Here I want to discuss one of the least known, but easiest suggestions on how to justify alpha levels in the literature, proposed by Good. The idea is simple, and has been supported by many statisticians in the last 80 years: Lower the alpha level as a function of your sample size.

The idea behind this recommendation is most extensively discussed in a book by Leamer (1978, p. 92). He writes:

The rule of thumb quite popular now, that is, setting the significance level arbitrarily to .05, is shown to be deficient in the sense that from every reasonable viewpoint the significance level should be a decreasing function of sample size.

Leamer (you can download his book for free) correctly notes that this behavior, an alpha level that is a decreasing function of the sample size, makes sense from both a Bayesian as a Neyman-Pearson perspective. Let me explain.

Imagine a researcher who performs a study that has 99.9% power to detect the smallest effect size the researcher is interested in, based on a test with an alpha level of 0.05. Such a study also has 99.8% power when using an alpha level of 0.03. Feel free to follow along here, by setting the sample size to 204, the effect size to 0.5, alpha or p-value (upper limit) to 0.05, and the p-value (lower limit) to 0.03.

We see that if the alternative hypothesis is true only 0.1% of the observed studies will, in the long run, observe a p-value between 0.03 and 0.05. When the null-hypothesis is true 2% of the studies will, in the long run, observe a p-value between 0.03 and 0.05. Note how this makes p-values between 0.03 and 0.05 more likely when there is no true effect, than when there is an effect. This is known as Lindley’s paradox (and I explain this in more detail in Assignment 1 in my MOOC, which you can also do here).

Although you can argue that you are still making a Type 1 error at most 5% of the time in the above situation, I think it makes sense to acknowledge there is something weird about having a Type 1 error of 5% when you have a Type 2 error of 0.1% (again, see Mudge, Baker, Edge, & Houlahan, 2012, who suggest balancing error rates). To me, it makes sense to design a study where error rates are more balanced, and a significant effect is declared for p-values more likely to occur when the alternative model is true than when the null model is true.

Because power increases as the sample size increases, and because Lindley’s paradox (Lindley, 1957, see also Cousins, 2017) can be prevented by lowering the alpha level sufficiently, the idea to lower the significance level as a function of the sample is very reasonable. But how?

Zellner (1971) discusses how the critical value for a frequentist hypothesis test approaches a limit as the sample size increases (i.e., a critical value of 1.96 for p = 0.05 in a two-sided test) whereas the critical value for a Bayes factor increases as the sample size increases (see also Rouder, Speckman, Sun, Morey, & Iverson, 2009). This difference lies at the heart of Lindley’s paradox, and under certain assumptions comes down to a factor of ?n. As Zellner (1971, footnote 19, page 304) writes (K01 is the formula for the Bayes factor):

If a sampling theorist were to adjust his significance level upward as n grows larger, which seems reasonable, za would grow with n and tend to counteract somewhat the influence of the ?n factor in the expression for K01.

Jeffreys (1939) discusses Neyman and Pearson’s work and writes:

We should therefore get the best result, with any distribution of ?, by some form that makes the ratio of the critical value to the standard error increase with n. It appears then that whatever the distribution may be, the use of a fixed P limit cannot be the one that will make the smallest number of mistakes.

He discusses the issue more in Appendix B, where he compared his own test (Bayes factors) against Neyman-Pearson decision procedures, and he notes that:

In spite of the difference in principle between my tests and those based on the P integrals, and the omission of the latter to give the increase of the critical values for large n, dictated essentially by the fact that in testing a small departure found from a large number of observations we are selecting a value out of a long range and should allow for selection, it appears that there is not much difference in the practical recommendations. Users of these tests speak of the 5 per cent. point in much the same way as I should speak of the K = 10 point, and of the 1 per cent. point as I should speak of the K = I0-1 point; and for moderate numbers of observations the points are not very different. At large numbers of observations there is a difference, since the tests based on the integral would sometimes assert significance at departures that would actually give K > I. Thus there may be opposite decisions in such cases. But they will be very rare.

So even though extremely different conclusions between Bayes factors and frequentist tests will be rare, according to Jeffreys, when the sample size grows, the difference becomes noticeable.

This brings us to Good’s (1982) easy solution. His paper is basically just a single page (I’d love something akin to a Comments, Conjectures, and Conclusions format in Meta-Psychology! – note that Good himself was the section editor, which started with ‘Please be succinct but lucid and interesting’, and it reads just like a blog post).

He also explains the rationale in Good (1992):

‘we have empirical evidence that sensible P values are related to weights of evidence and, therefore, that P values are not entirely without merit. The real objection to P values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.

Based on the observation by Jeffrey’s (1939) that, under specific circumstances, the Bayes factor against the null-hypothesis is approximately inversely proportional to ?N, Good (1982) suggests a standardized p-value to bring p-values in closer relationship with weights of evidence:

This formula standardizes the p-value to the evidence against the null hypothesis that what would be found if the pstan-value was the tail area probability observed in a sample of 100 participants (I think the formula is only intended for between designs – I would appreciate anyone weighing in in the comments if it can be extended to within-designs). When the sample size is 100, the p-value and pstan are identical. But for larger sample sizes pstan is larger than p. For example, a p = .05 observed in a sample size of 500 would have a pstan of 0.11, which is not enough to reject the null-hypothesis for the alternative. Good (1988) demonstrates great insight when he writes: ‘I guess that standardized p-values will not become standard before the year 2000.’


Good doesn’t give a lot of examples of how standardized p-values should be used in practice, but I guess it makes things easier to think about a standardized alpha level (even though the logic is the same, just like you can double the p-value, or halve the alpha level, when you are correcting for 2 comparisons in a Bonferroni correction). So instead of an alpha level of 0.05, we can think of a standardized alpha level:
Again, with 100 participants ? and ?stan are the same, but as the sample size increases above 100, the alpha level becomes smaller. For example, a ? = .05 observed in a sample size of 500 would have a ?stan of 0.02236.

So one way to justify your alpha level is by using a decreasing alpha level as the sample size increases. I for one have always thought it was rather nonsensical to use an alpha level of 0.05 in all meta-analyses (especially when testing a meta-analytic effect size based on thousands of participants against zero), or large collaborative research project such as Many Labs, where analyses are performed on very large samples. If you have thousands of participants, you have extremely high power for most effect sizes original studies could have detected in a significance test. With such a low Type 2 error rate, why keep the Type 1 error rate fixed at 5%, which is so much larger than the Type 2 error rate in these analyses? It just doesn’t make any sense to me. Alpha levels in meta-analyses or large-scale data analyses should be lowered as a function of the sample size. In case you are wondering: an alpha level of .005 would be used when the sample size is 10.000.

When designing a study based on a specific smallest effect size of interest, where you desire to have decent power (e.g., 90%), we run in to a small challenge because in the power analysis we now have two unknowns: The sample size (which is a function of the power, effect size, and alpha), and the standardized alpha level (which is a function of the sample size). Luckily, this is nothing that some R-fu can’t solve by some iterative power calculations. [R code to calculate the standardized alpha level, and perform an iterative power analysis, is at the bottom of the post]

When we wrote Justify Your Alpha (I recommend downloading the original draft before peer review because it has more words and more interesting references) one of the criticism I heard the most is that we gave no solutions how to justify your alpha. I hope this post makes it clear that statisticians have discussed that the alpha level should not be any fixed value even since it was invented. There are already some solutions available in the literature. I like Good’s approach because it is simple. In my experience, people like simple solutions. It might not be a full-fledged decision theoretical cost-benefit analysis, but it beats using a fixed alpha level. I recently used it in a submission for a Registered Report. At the same time, I think it has never been used in practice, so I look forward to any comments, conjectures, and conclusions you might have. 

References

Good, I. J. (1982). C140. Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16(1), 65–66. https://doi.org/10.1080/00949658208810607
Good, I. J. (1988). The interface between statistics and philosophy of science. Statistical Science, 3(4), 386–397.
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. https://doi.org/10.2307/2290192
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Leamer, E. E. (1978). Specification Searches: Ad Hoc Inference with Nonexperimental Data (1 edition). New York usw.: Wiley.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.



 

Equivalence Testing and the Second Generation P-Value

Recently Blume, D’Agostino McGowan, Dupont, & Greevy (2018) published an article titled: “Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses”. As it happens, I would greatly appreciate more rigor, reproducibility, and transparency in statistical analyses, so my interest was piqued. On Twitter I saw the following slide, promising a updated version of the p-value that can support null-hypotheses, takes practical significance into account, has a straightforward interpretation, and ideally never needs adjustments for multiple comparisons. Now it sounded like someone found the goose that lays the golden eggs.

Upon reading the manuscript, I noticed the statistic is surprisingly similar to equivalence testing, which I’ve written about recently and created an R package for (Lakens, 2017). The second generation p-value (SGPV) relies on specifying an equivalence range of values around the null-hypothesis that are practically equivalent to zero (e.g., 0 ± 0.3). If the estimation interval falls completely within the equivalence range, the SGPV is 1. If the confidence interval lies completely outside of the equivalence range, the SGPV is 0. Otherwise the SGPV is a value between 0 and 1 that expresses the overlap of the confidence interval with the equivalence bound, divided by the total width of the confidence interval.
Testing whether the confidence interval falls completely within the equivalence bounds is equivalent to the two one-sided tests (TOST) procedure, where the data is tested against the lower equivalence bound in the first one-sided test, and against the upper equivalence bound in the second one-sided test. If both tests allow you to reject an effect as extreme or more extreme than the equivalence bound, you can reject the presence of an effect large enough to be meaningful, and conclude the observed effect is practically equivalent to zero. You can also simply check if a 90% confidence interval falls completely within the equivalence bounds. Note that testing whether the 95% confidence interval falls completely outside of the equivalence range is known as a minimum-effect test (Murphy, Myors, & Wolach, 2014).
So together with my collaborator Marie Delacre we compared the two approaches, to truly understand how second generation p-values accomplished what they were advertised to do, and what they could contribute to our statistical toolbox.
To examine the relation between the TOST p-value and the SGPV we can calculate both statistics across a range of observed effect sizes. In Figure 1 p-values are plotted for the TOST procedure and the SGPV. The statistics are calculated for hypothetical one-sample t-tests for all means that can be observed in studies ranging from 140 to 150 (on the x-axis). The equivalence range is set to 145 ± 2 (i.e., an equivalence range from 143 to 147), the observed standard deviation is assumed to be 2, and the sample size is 100. The SGPV treats the equivalence range as the null-hypothesis, while the TOST procedure treats the values outside of the equivalence range as the null-hypothesis. For ease of comparison we can reverse the SGPV (by calculating 1-SGPV), which is used in the plot below.
 
 
Figure 1: Comparison of p-values from TOST (black line) and 1-SGPV (dotted grey line) across a range of observed sample means (x-axis) tested against a mean of 145 in a one-sample t-test with a sample size of 30 and a standard deviation of 2.
It is clear the SGPV and the p-value from TOST are very closely related. The situation in Figure 1 is not an exception – in our pre-print we describe how the SGPV and the p-value from the TOST procedure are always directly related when confidence intervals are symmetrical. You can play around with this Shiny app as confirm this for yourself: http://shiny.ieis.tue.nl/TOST_vs_SGPV/.
There are 3 situations where the p-value from the TOST procedure and the SGPV are not directly related. The SGPV is 1 when the confidence interval falls completely within the equivalence bounds. P-values from the TOST procedure continue to differentiate and will for example distinguish between a p = 0.048 and p = 0.002. The same happens when the SGPV is 0 (and p-values fall between 0.975 and 1).
The third situation when the TOST and SGPV differ is when the ‘small sample correction’ is at play in the SGPV. This “correction” kicks in whenever the confidence interval is wider than the equivalence range. However, it is not a correction in the typical sense of the word, since the SGPV is not adjusted to any ‘correct’ value. When the normal calculation would be ‘misleading’ (i.e., the SGPV would be small, which normally would suggest support for the alternative hypothesis, when all values in the equivalence range are also supported), the SGPV is set to 0.5 which according to Blume and colleagues signal the SGPV is ‘uninformative’.In all three situations the p-value from equivalence tests distinguishes between scenarios where the SGPV yields the same result.
We can examine this situation by calculating the SGPV and performing the TOST for a situation where sample sizes are small and the equivalence range is narrow, such that the CI is more than twice as large as the equivalence range.
 
Figure 2: Comparison of p-values from TOST (black line) and SGPV (dotted grey line) across a range of observed sample means (x-axis). Because the sample size is small (n = 10) and the CI is more than twice as wide as the equivalence range (set to -0.4 to 0.4), the SGPV is set to 0.5 (horizontal light grey line) across a range of observed means.

The main novelty of the SGPV is that it is meant to be used as a descriptive statistic. However, we show that the SGPV is difficult to interpret when confidence intervals are asymmetric, and when the ‘small sample correction’ is operating. For an extreme example, see Figure 3 where the SGPV’s are plotted for a correlation (where confidence intervals are asymmetric). 

Figure 3: Comparison of p-values from TOST (black line) and 1-SGPV (dotted grey curve) across a range of observed sample correlations (x-axis) tested against equivalence bounds of r = 0.4 and r = 0.8 with n = 10 and an alpha of 0.05.
Even under ideal circumstances, the SGPV is mainly meaningful when it is either 1, 0, or inconclusive (see all examples in Blume et al., 2018). But to categorize your results into one of these three outcomes you don’t need to calculate anything – you can just look at whether the confidence interval falls inside, outside, or overlaps with the equivalence bound, and thus the SGPV loses its value as a descriptive statistic. 
When discussing the lack of a need for error correction, Blume and colleagues compare the SGPV to null-hypothesis tests. However, the more meaningful comparison is with the TOST procedure, and given the direct relationship, not correcting for multiple comparisons will inflate the probability of concluding the absence of a meaningful effect in exactly the same way as when calculating p-values for an equivalence test. Equivalence tests provide an easier and more formal way to control both Type I error rates (by setting the alpha level) and the Type II error rate (by performing an a-priori power analysis, see Lakens, Scheele, & Isager, 2018).
Conclusion
There are strong similarities between p-values from the TOST procedure and the SGPV, and in all situations where the statistics yield different results, the behavior of the p-value from the TOST procedure is more consistent and easier to interpret. More details can be found in our pre-print (where you can also leave comments or suggestions for improvement using hypothes.is). Our comparisons show that when proposing alternatives to null-hypothesis tests, it is important to compare new proposals to already existing procedures. We believe equivalence tests achieve the goals of the second generation p-value while allowing users to more easily control error rates, and while yielding more consistent statistical outcomes.


References
Blume, J. D., D’Agostino McGowan, L., Dupont, W. D., & Greevy, R. A. (2018). Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLOS ONE, 13(3), e0188299. https://doi.org/10.1371/journal.pone.0188299
Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 2515245918770963. https://doi.org/10.1177/2515245918770963.
Murphy, K. R., Myors, B., & Wolach, A. H. (2014). Statistical power analysis: a simple and general model for traditional and modern hypothesis tests (Fourth edition). New York: Routledge, Taylor & Francis Group.