Before we launch into 2023, a look back at 2022 in The Scholarly Kitchen.
The post The Year in Review: 2022 in The Scholarly Kitchen appeared first on The Scholarly Kitchen.
Before we launch into 2023, a look back at 2022 in The Scholarly Kitchen.
The post The Year in Review: 2022 in The Scholarly Kitchen appeared first on The Scholarly Kitchen.
During a recent workshop on Sample Size Justification an early career researcher asked me: “You recommend sequential analysis in your paperfor when effect sizes are uncertain, where researchers collect data, analyze the data, stop when a test is significant, or continue data collection when a test is not significant, and, I don’t want to be rude, but isn’t this p-hacking?”
In linguistics there is a term for when children apply a rule they have learned to instances where it does not apply: Overregularization. They learn ‘one cow, two cows’, and use the +s rule for plural where it is not appropriate, such as ‘one mouse, two mouses’ (instead of ‘two mice’). The early career researcher who asked me if sequential analysis was a form of p-hacking was also overregularizing. We teach young researchers that flexibly analyzing data inflates error rates, is called p-hacking, and is a very bad thing that was one of the causes of the replication crisis. So, they apply the rule ‘flexibility in the data analysis is a bad thing’ to cases where it does not apply, such as in the case of sequential analyses. Yes, sequential analyses give a lot of flexibility to stop data collection, but it does so while carefully controlling error rates, with the added bonus that it can increase the efficiency of data collection. This makes it a good thing, not p-hacking.
Children increasingly use correct language the longer they are immersed in it. Many researchers are not yet immersed in an academic environment where they see flexibility in the data analysis applied correctly. Many are scared to do things wrong, which risks becoming overly conservative, as the pendulum from ‘we are all p-hacking without realizing the consequences’ swings back to far to ‘all flexibility is p-hacking’. Therefore, I patiently explain during workshops that flexibility is not bad per se, but that making claims without controlling your error rate is problematic.
In a recent podcast episode of ‘Quantitude’ one of the hosts shared a similar experience 5 minutes into the episode. A young student remarked that flexibility during the data analysis was ‘unethical’. The remainder of the podcast episode on ‘researcher degrees of freedom’ discussed how flexibility is part of data analysis. They clearly state that p-hacking is problematic, and opportunistic motivations to perform analyses that give you what you want to find should be constrained. But they then criticized preregistration in ways many people on Twitter disagreed with. They talk about ‘high priests’ who want to ‘stop bad people from doing bad things’ which they find uncomfortable, and say ‘you can not preregister every contingency’. They remark they would be surprised if data could be analyzed without requiring any on the fly judgment.
Although the examples they gave were not very good1 it is of course true that researchers sometimes need to deviate from an analysis plan. Deviating from an analysis plan is not p-hacking. But when people talk about preregistration, we often see overregularization: “Preregistration requires specifying your analysis plan to prevent inflation of the Type 1 error rate, so deviating from a preregistration is not allowed.” The whole point of preregistration is to transparently allow other researchers to evaluate the severity of a test, both when you stick to the preregistered statistical analysis plan, as when you deviate from it. Some researchers have sufficient experience with the research they do that they can preregister an analysis that does not require any deviations2, and then readers can see that the Type 1 error rate for the study is at the level specified before data collection. Other researchers will need to deviate from their analysis plan because they encounter unexpected data. Some deviations reduce the severity of the test by inflating the Type 1 error rate. But other deviations actually get you closer to the truth. We can not know which is which. A reader needs to form their own judgment about this.
A final example of overregularization comes from a person who discussed a new study that they were preregistering with a junior colleague. They mentioned the possibility of including a covariate in an analysis but thought that was too exploratory to be included in the preregistration. The junior colleague remarked: “But now that we have thought about the analysis, we need to preregister it”. Again, we see an example of overregularization. If you want to control the Type 1 error rate in a test, preregister it, and follow the preregistered statistical analysis plan. But researchers can, and should, explore data to generate hypotheses about things that are going on in their data. You can preregister these, but you do not have to. Not exploring data could even be seen as research waste, as you are missing out on the opportunity to generate hypotheses that are informed by data. A case can be made that researchers should regularly include variables to explore (e.g., measures that are of general interest to peers in their field), as long as these do not interfere with the primary hypothesis test (and as long as these explorations are presented as such).
In the book “Reporting quantitative research in psychology: How to meet APA Style Journal Article Reporting Standards” by Cooper and colleagues from 2020 a very useful distinction is made between primary hypotheses, secondary hypotheses, and exploratory hypotheses. The first consist of the main tests you are designing the study for. The secondary hypotheses are also of interest when you design the study – but you might not have sufficient power to detect them. You did not design the study to test these hypotheses, and because the power for these tests might be low, you did not control the Type 2 error rate for secondary hypotheses. You canpreregister secondary hypotheses to control the Type 1 error rate, as you know you will perform them, and if there are multiple secondary hypotheses, as Cooper et al (2020) remark, readers will expect “adjusted levels of statistical significance, or conservative post hoc means tests, when you conducted your secondary analysis”.
If you think of the possibility to analyze a covariate, but decide this is an exploratory analysis, you can decide to neither control the Type 1 error rate nor the Type 2 error rate. These are analyses, but not tests of a hypothesis, as any findings from these analyses have an unknown Type 1 error rate. Of course, that does not mean these analyses can not be correct in what they reveal – we just have no way to know the long run probability that exploratory conclusions are wrong. Future tests of the hypotheses generated in exploratory analyses are needed. But as long as you follow Journal Article Reporting Standards and distinguish exploratory analyses, readers know what the are getting. Exploring is not p-hacking.
People in psychology are re-learning the basic rules of hypothesis testing in the wake of the replication crisis. But because they are not yet immersed in good research practices, the lack of experience means they are overregularizing simplistic rules to situations where they do not apply. Not all flexibility is p-hacking, preregistered studies do not prevent you from deviating from your analysis plan, and you do not need to preregister every possible test that you think of. A good cure for overregularization is reasoning from basic principles. Do not follow simple rules (or what you see in published articles) but make decisions based on an understanding of how to achieve your inferential goal. If the goal is to make claims with controlled error rates, prevent Type 1 error inflation, for example by correcting the alpha level where needed. If your goal is to explore data, feel free to do so, but know these explorations should be reported as such. When you design a study, follow the Journal Article Reporting Standards and distinguish tests with different inferential goals.
1 E.g., they discuss having to choose between Student’s t-test and Welch’s t-test, depending on wheter Levene’s test indicates the assumption of homogeneity is violated, which is not best practice – just follow R, and use Welch’s t-test by default.
2 But this is rare – only 2 out of 27 preregistered studies in Psychological Science made no deviations. https://royalsocietypublishing.org/doi/full/10.1098/rsos.211037We can probably do a bit better if we only preregistered predictions at a time where we really understand our manipulations and measures.
Many of the facts in this blog post come from the biography ‘Neyman’ by Constance Reid. I highly recommend reading this book if you find this blog interesting.
In recent years researchers have become increasingly interested in the relationship between eugenics and statistics, especially focusing on the lives of Francis Galton, Karl Pearson, and Ronald Fisher. Some have gone as far as to argue for a causal relationship between eugenics and frequentist statistics. For example, in a recent book ‘Bernouilli’s Fallacy’, Aubrey Clayton speculates that Fisher’s decision to reject prior probabilities and embrace a frequentist approach was “also at least partly political”. Rejecting prior probabilities, Clayton argues, makes science seem more ‘objective’, which would have helped Ronald Fisher and his predecessors to establish eugenics as a scientific discipline, despite the often-racist conclusions eugenicists reached in their work.
When I was asked to review an early version of Clayton’s book for Columbia University Press, I thought that the main narrative was rather unconvincing, and thought the presented history of frequentist statistics was too one-sided and biased. Authors who link statistics to problematic political views often do not mention equally important figures in the history of frequentist statistics who were in all ways the opposite of Ronald Fisher. In this blog post, I want to briefly discuss the work and life of Jerzy Neyman, for two reasons.
Jerzy Neyman (image from https://statistics.berkeley.edu/people/jerzy-neyman)
First, the focus on Fisher’s role in the history of frequentist statistics is surprising, given that the dominant approach to frequentist statistics used in many scientific disciplines is the Neyman-Pearson approach. If you have ever rejected a null hypothesis because a p-value was smaller than an alpha level, or if you have performed a power analysis, you have used the Neyman-Pearson approach to frequentist statistics, and not the Fisherian approach. Neyman and Fisher disagreed vehemently about their statistical philosophies (in 1961 Neyman published an article titled ‘Silver Jubilee of My Dispute with Fisher’), but it was Neyman’s philosophy that won out and became the default approach to hypothesis testing in most fields[i]. Anyone discussing the history of frequentist hypothesis testing should therefore seriously engage with the work of Jerzy Neyman and Egon Pearson. Their work was not in line with the views of Karl Pearson, Egon’s father, nor the views of Fisher. Indeed, it was a great source of satisfaction to Neyman that their seminal 1933 paper was presented to the Royal Society by Karl Pearson, who was hostile and skeptical of the work, and (as Neyman thought) reviewed by Fisher[ii], who strongly disagreed with their philosophy of statistics.
Second, Jerzy Neyman was also the opposite to Fisher in his political viewpoints. Instead of promoting eugenics, Neyman worked to improve the position of those less privileged throughout his life, teaching disadvantaged people in Poland, and creating educational opportunities for Americans at UC Berkeley. He hired David Blackwell, who was the first Black tenured faculty member at UC Berkeley. This is important, because it falsifies the idea put forward by Clayton[iii]that frequentist statistics became the dominant approach in science because the most important scientists who worked on it wanted to pretend their dubious viewpoints were based on ‘objective’ scientific methods.
I think it is useful to broaden the discussion of the history of statistics, beyond the work by Fisher and Karl Pearson, and credit the work of others[iv]who contributed in at least as important ways to the statistics we use today. I am continually surprised about how few people working outside of statistics even know the name of Jerzy Neyman, even though they regularly use his insights when testing hypotheses. In this blog, I will try to describe his work and life to add some balance to the history of statistics that most people seem to learn about. And more importantly, I hope Jerzy Neyman can be a positive role-model for young frequentist statisticians, who might so far have only been educated about the life of Ronald Fisher.
Neyman’s personal life
Neyman was born in 1984 in Russia, but raised in Poland. After attending the gymnasium, he studied at the University of Kharkov. Initially trying to become an experimental physicist, he was too clumsy with his hands, and switched to conceptual mathematics, in which he concluded his undergraduate in 1917 in politically tumultuous times. In 1919 he met his wife, and they marry in 1920. Ten days later, because of the war between Russia and Poland, Neyman is imprisoned for a short time, and in 1921 flees to a small village to avoid being arrested again, where he obtains food by teaching the children of farmers. He worked for the Agricultural Institute, and then worked at the University in Warsaw. He obtained his doctor’s degree in 1924 at age 30. In September 1925 he was sent to London for a year to learn about the latest developments in statistics from Karl Pearson himself. It is here that he met Egon Pearson, Karl’s son, and a friendship and scientific collaboration starts.
Neyman always spends a lot of time teaching, often at the expense of doing scientific work. He was involved in equal opportunity education in 1918 in Poland, teaching in dimly lit classrooms where the rag he used to wipe the blackboard would sometimes freeze. He always had a weak spot for intellectuals from ‘disadvantaged’ backgrounds. He and his wife were themselves very poor until he moved to UC Berkeley in 1938. In 1929, back in Poland, his wife becomes ill due to their bad living conditions, and the doctor who comes to examine her is so struck by their miserable living conditions he offers the couple stay in his house for the same rent they were paying while he visits France for 6 months. In his letters to Egon Pearson from this time, he often complained that the struggle for existence takes all his time and energy, and that he can not do any scientific work.
Even much later in his life, in 1978, he kept in mind that many people have very little money, and he calls ahead to restaurants to make sure a dinner before a seminar would not cost too much for the students. It is perhaps no surprise that most of his students (and he had many) talk about Neyman with a lot of appreciation. He wasn’t perfect (for example, Erich Lehmann – one of Neyman’s students – remarks how he was no longer allowed to teach a class after Lehmann’s notes, building on but extending the work by Neyman, became extremely popular – suggesting Neyman was no stranger to envy). But his students were extremely positive about the atmosphere he created in his lab. For example, job applicants were told around 1947 that “there is no discrimination on the basis of age, sex, or race … authors of joint papers are always listed alphabetically.”
Neyman himself often suffered discrimination, sometimes because of his difficulty mastering the English language, sometimes for being Polish (when in Paris a piece of clothing, and ermine wrap, is stolen from their room, the police responds “What can you expect – only Poles live there!”), sometimes because he did not believe in God, and sometimes because his wife was Russian and very emancipated (living independently in Paris as an artist). He was fiercely against discrimination. In 1933, as anti-Semitism is on the rise among students at the university where he works in Poland, he complains to Egon Pearson in a letter that the students are behaving with Jews as Americans do with people of color. In 1941 at UC Berkeley he hired women at a time it was not easy for a woman to get a job in mathematics.
In 1942, Neyman examined the possibility of hiring David Blackwell, a Black statistician, then still a student. Neyman met him in New York (so that Blackwell does not need to travel to Berkeley at his own expense) and considered Blackwell the best candidate for the job. The wife of a mathematics professor (who was born in the south of the US) learned about the possibility that a Black statistician might be hired, warns she will not invite a Black man to her house, and there was enough concern for the effect the hire would have on the department that Neyman can not make an offer to Blackwell. He is able to get Blackwell to Berkeley in 1953 as a visiting professor, and offers him a tenured job in 1954, making David Blackwell the first tenured faculty member at the University of Berkeley, California. And Neyman did this, even though Blackwell was a Bayesian[v];).
In 1963, Neyman travelled to the south of the US and for the first time directly experienced the segregation. Back in Berkeley, a letter is written with a request for contributions for the Southern Christian Leadership Conference (founded by Martin Luther King, Jr. and others), and 4000 copies are printed and shared with colleagues at the university and friends around the country, which brought in more than $3000. He wrote a letter to his friend Harald Cramér that he believed Martin Luther King, Jr. deserved a Nobel Peace Prize (which Cramér forwarded to the chairman of the Nobel Committee, and which he believed might have contributed at least a tiny bit to fact that Martin Luther King, Jr. was awarded the Nobel Prize a year later). Neyman also worked towards the establishment of a Special Scholarships Committee at UC Berkeley with the goal of providing education opportunities to disadvantaged Americans
Neyman was not a pacifist. In the second world war he actively looked for ways he could contribute to the war effort. He is involved in statistical models that compute the optimal spacing of bombs by planes to clear a path across a beach of land mines. (When at a certain moment he needs specifics about the beach, a representative from the military who is not allowed to directly provide this information asks if Neyman has ever been to the seashore in France, to which Neyman replies he has been to Normandy, and the representative answers “Then use that beach!”). But Neyman early and actively opposed the Vietnam war, despite the risk of losing lucrative contracts the Statistical Laboratory had with the Department of Defense. In 1964 he joined a group of people who bought advertisements in local newspapers with a picture of a napalmed Vietnamese child with the quote “The American people will bluntly and plainly call it murder”.
A positive role model
It is important to know the history of a scientific discipline. Histories are complex, and we should resist overly simplistic narratives. If your teacher explains frequentist statistics to you, it is good if they highlight that someone like Fisher had questionable ideas about eugenics. But the early developments in frequentist statistics involved many researchers beyond Fisher, and, luckily, there are many more positive role-models that also deserve to be mentioned – such as Jerzy Neyman. Even though Neyman’s philosophy on statistical inferences forms the basis of how many scientists nowadays test hypotheses, his contributions and personal life are still often not discussed in histories of statistics – an oversight I hope the current blog post can somewhat mitigate. If you want to learn more about the history of statistics through Neyman’s personal life, I highly recommend the biography of Neyman by Constance Reid, which was the source for most of the content of this blog post.
[i] See Hacking, 1965: “The mature theory of Neyman and Pearson is very nearly the received theory on testing statistical hypotheses.”
[ii] It turns out, in the biography, that it was not Fisher, but A. C. Aitken, who reviewed the paper positively.
[iii] Clayton’s book seems to be mainly intended as an attempt to persuade readers to become a Bayesian, and not as an accurate analysis of the development of frequentist statistics.
[iv] William Gosset (or ‘Student’, from ‘Student’s t-test’), who was the main inspiration for the work by Neyman and Pearson, is another giant in frequentist statistics who does not in any way fit into the narrative that frequentist statistics is tied to eugenics, as his statistical work was motivated by applied research questions in the Guinness brewery. Gosset was a modest man – which is probably why he rarely receives the credit he is due.
[v] When asked about his attitude towards Bayesian statistics in 1979, he answered: “It does not interest me. I am interested in frequencies.” He did note multiple legitimate approaches to statistics exist, and the choice one makes is largely a matter of personal taste. Neyman opposed subjective Bayesian statistics because their use could lead to bad decision procedures, but was very positive about later work by Wald, which inspired Bayesian statistical decision theory.
In the first partially in person scientific meeting I am attending after the COVID-19 pandemic, the Perspectives on Scientific Error conference in the Lorentz Center in Leiden, the organizers asked Eric-Jan Wagenmakers and myself to engage in a discussion about p-values and Bayes factors. We each gave 15 minute presentations to set up our arguments, centered around 3 questions: What is the goal of statistical inference, What is the advantage of your approach in a practical/applied context, and when do you think the other approach may be applicable?
What is the goal of statistical inference?
When browsing through the latest issue of Psychological Science, many of the titles of scientific articles make scientific claims. “Parents Fine-Tune Their Speech to Children’s Vocabulary Knowledge”, “Asymmetric Hedonic Contrast: Pain is More Contrast Dependent Than Pleasure”, “Beyond the Shape of Things: Infants Can Be Taught to Generalize Nouns by Objects’ Functions”, “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis”, or “Response Bias Reflects Individual Differences in Sensory Encoding”. These authors are telling you that if you take away one thing from the work the have been doing, it is a claim that some statistical relationship is present or absent. This approach to science, where researchers collect data to make scientific claims, is extremely common (we discuss this extensively in our preprint “The Epistemic and Pragmatic Function of Dichotomous Claims Based on Statistical Hypothesis Tests” by Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). It is not the only way to do science – there is purely descriptive work, or estimation, where researchers present data without making any claims beyond the observed data, so there is never a single goal in statistical inferences – but if you browse through scientific journals, you will see that a large percentage of published articles have the goal to make one or more scientific claims.
Claims can be correct or wrong. If scientists used a coin flip as their preferred methodological approach to make scientific claims, they would be right and wrong 50% of the time. This error rate is considered too high to make scientific claims useful, and therefore scientists have developed somewhat more advanced methodological approaches to make claims. One such approach, widely used across scientific fields, is Neyman-Pearson hypothesis testing. If you have performed a statistical power analysis when designing a study, and if you think it would be problematic to p-hack when analyzing the data from your study, you engaged in Neyman-Pearson hypothesis testing. The goal of Neyman-Pearson hypothesis testing is to control the maximum number of incorrect scientific claims the scientific community collectively makes. For example, when authors write “The Bilingual Advantage in Children’s Executive Functioning is Not Related to Language Status: A Meta-Analysis” we could expect a study design where people specified a smallest effect size of interest, and statistically reject the presence of any worthwhile effect of bilingual advantage in children on executive functioning based on language status in an equivalence test. They would make such a claim with a pre-specified maximum Type 1 error rate, or the alpha level, often set to 5%. Formally, authors are saying “We might be wrong, but we claim there is no meaningful effect here, and if all scientists collectively act as if we are correct about claims generated by this methodological procedure, we would be misled no more than alpha% of the time, which we deem acceptable, so let’s for the foreseeable future (until new data emerges that proves us wrong) assume our claim is correct”. Discussion sections are often less formal, and researchers often violate the code of conduct for research integrity by selectively publishing only those results that confirm their predictions, which messes up many of the statistical conclusions we draw in science.
The process of claim making described above does not depend on an individual’s personal beliefs, unlike some Bayesian approaches. As Taper and Lele (2011) write: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” This view is strongly based on the idea that the goal of statistical inference is the accumulation of correct scientific claims through methodological procedures that lead to the same claims by all scientists who evaluate the tests of these claims. Incorporating individual priors into statistical inferences, and making claims dependent on their prior belief, does not provide science with a methodological procedure that generates collectively established scientific claims. Bayes factors provide a useful and coherent approach to update individual beliefs, but they are not a useful tool to establish collectively agreed upon scientific claims.
What is the advantage of your approach in a practical/applied context?
A methodological procedure built around a Neyman-Pearson perspective works well in a science where scientists want to make claims, but we want to prevent too many incorrect scientific claims. One attractive property of this methodological approach to make scientific claims is that the scientific community can collectively agree upon the severity with which a claim has been tested. If we design a study with 99.9% power for the smallest effect size of interest and use a 0.1% alpha level, everyone agrees the risk of an erroneous claim is low. If you personally do not like the claim, several options for criticism are possible. First, you can argue that no matter how small the error rate was, errors still occur with their appropriate frequency, no matter how surprised we would be if they occur to us (I am paraphrasing Fisher). Thus, you might want to run two or three replications, until the probability of an error has become too small for the scientific community to consider it sensible to perform additional replication studies based on a cost-benefit analysis. Because it is practically very difficult to reach agreement on cost-benefit analyses, the field often resorts to rules or regulations. Just like we can debate if it is sensible to allow people to drive 138 kilometers per hour on some stretches of road at some time of the day if they have a certain level of driving experience, such discussions are currently too complex to practically implement, and instead, thresholds of 50, 80, 100, and 130 are used (depending on location and time of day). Similarly, scientific organizations decide upon thresholds that certain subfields are expected to use (such as an alpha level of 0.000003 in physics to declare a discovery, or the 2 study rule of the FDA).
Subjective Bayesian approaches can be used in practice to make scientific claims. For example, one can preregister that a claim will be made when a BF > 10 and smaller than 0.1. This is done in practice, for example in Registered Reports in Nature Human Behavior. The problem is that this methodological procedure does not in itself control the rate of erroneous claims. Some researchers have published frequentist analyses of Bayesian methodological decision rules (Note: Leonard Held brought up these Bayesian/Frequentist compromise methods as well – during coffee after our discussion, EJ and I agreed that we like those approaches, as they allow researcher to control frequentist errors, while interpreting the evidential value in the data – it is a win-won solution). This works by determining through simulations which test statistic should be used as a cut-off value to make claims. The process is often a bit laborious, but if you have the expertise and care about evidential interpretations of data, do it.
In practice, an advantage of frequentist approaches is that criticism has to focus on data and the experimental design, which can be resolved in additional experiments. In subjective Bayesian approaches, researchers can ignore the data and the experimental design, and instead waste time criticizing priors. For example, in a comment on Bem (2011) Wagenmakers and colleagues concluded that “We reanalyze Bem’s data with a default Bayesian t test and show that the evidence for psi is weak to nonexistent.” In a response, Bem, Utts, and Johnson stated “We argue that they have incorrectly selected an unrealistic prior distribution for their analysis and that a Bayesian analysis using a more reasonable distribution yields strong evidence in favor of the psi hypothesis.” I strongly expect that most reasonable people would agree more strongly with the prior chosen by Bem and colleagues, than the prior chosen by Wagenmakers and colleagues (Note: In the discussion EJ agreed he in hindsight did not believe the prior in the main paper was the best choice, but noted the supplementary files included a sensitivity analysis that demonstrated the conclusions were robust across a range of priors, and that the analysis by Bem et al combined Bayes factors in a flawed approach). More productively than discussing priors, data collected in direct replications since 2011 consistently lead to claims that there is no precognition effect. As Bem has not been able to succesfully counter the claims based on data collected in these replication studies, we can currently collectively as if Bem’s studies were all Type 1 errors (in part caused due to extensive p-hacking).
When do you think the other approach may be applicable?
Even when, in the approach the science I have described here, Bayesian approaches based on individual beliefs are not useful to make collectively agreed upon scientific claims, all scientists are Bayesians. First, we have to rely on our beliefs when we can not collect sufficient data to repeatedly test a prediction. When data is scarce, we can’t use a methodological procedure that makes claims with low error rates. Second, we can benefit from prior information when we know we can not be wrong. Incorrect priors can mislead, but if we know our priors are correct, even though this might be rare, use them. Finally, use individual beliefs when you are not interested in convincing others, but when you only want guide individual actions where being right or wrong does not impact others. For example, you can use your personal beliefs when you decide which study to run next.
Conclusion
In practice, analyses based on p-values and Bayes factors will often agree. Indeed, one of the points of discussion in the rest of the day was how we have bigger problems than the choice between statistical paradigms. A study with a flawed sample size justification or a bad measure is flawed, regardless of how we analyze the data. Yet, a good understanding of the value of the frequentist paradigm is important to be able to push back to problematic developments, such as researchers or journals who ignore the error rates of their claims, leading to rates of scientific claims that are incorrect too often. Furthermore, a discussion of this topic helps us think about whether we actually want to pursue the goals that our statistical tools achieve, and whether we actually want to organize knowledge generation by making scientific claims that others have to accept or criticize (a point we develop further in Uygun- Tunç, Tunç, & Lakens, https://psyarxiv.com/af9by/). Yes, discussions about P-Values and Bayes factors might in practice not have the biggest impact on improving our science, but it is still important and enjoyable to discuss these fundamental questions, and I’d like the thank EJ Wagenmakers and the audience for an extremely pleasant discussion.
When you perform multiple comparisons in a study, you need to control your alpha level for multiple comparisons. It is generally recommended to control for the family-wise error rate, but there is some confusion about what a ‘family’ is. As Bretz, Hothorn, & Westfall (2011) write in their excellent book “Multiple Comparisons Using R” on page 15: “The appropriate choice of null hypotheses being of primary interest is a controversial question. That is, it is not always clear which set of hypotheses should constitute the family H1,…,Hm. This topic has often been in dispute and there is no general consensus.” In one of the best papers on controlling for multiple comparisons out there, Bender & Lange (2001) write: “Unfortunately, there is no simple and unique answer to when it is appropriate to control which error rate. Different persons may have different but nevertheless reasonable opinions. In addition to the problem of deciding which error rate should be under control, it has to be defined first which tests of a study belong to one experiment.” The Wikipedia page on family-wise error rate is a mess.
But let’s get back to our first problem, which we can solve by making the claims people need to control Type 1 error rates for less vague. Lisa DeBruine and I recently proposed machine readable hypothesis tests to remove any ambiguity in the tests we will perform to examine statistical predictions, and when we will consider a claim corroborated or falsified. In this post, I am going to use our R package ‘scienceverse’ to clarify what constitutes a family of tests when controlling the family-wise error rate.
Something will happen
p_t_1
is confirmed if analysis ttest_1 yields p.value<0.05
The result was p.value = 0.452 (FALSE)
p_t_2
is confirmed if analysis ttest_2 yields p.value<0.05
The result was p.value = 0.21 (FALSE)
p_t_3
is confirmed if analysis ttest_3 yields p.value<0.05
The result was p.value = 0.02 (TRUE)
The hypothesis is corroborated if anything is significant.
p_t_1 | p_t_2 | p_t_3
The hypothesis is falsified if nothing is significant.
!p_t_1 & !p_t_2 & !p_t_3
All criteria were met for corroboration.
We see the hypothesis that ‘something will happen’ is corroborated, because there was a significant difference on dv3 – even though this was a Type 1 error, since we simulated data with a true effect size of 0 – and any difference was taken as support for the prediction. With a 5% alpha level, we will observe 1-(1-0.05)^3 = 14.26% Type 1 errors in the long run. This Type 1 error inflation can be prevented by lowering the alpha level, for example by a Bonferroni correction (0.05/3), after which the expected Type 1 error rate is 4.92% (see Bretz et al., 2011, for more advanced techniques to control error rates). When we examine the report for the second scenario, where each dv tests a unique hypothesis, we get the following output from scienceverse:
dv1 will show an effect
p_t_1
is confirmed if analysis ttest_1 yields p.value<0.05
The result was p.value = 0.452 (FALSE)
The hypothesis is corroborated if dv1 is significant.
p_t_1
The hypothesis is falsified if dv1 is not significant.
!p_t_1
All criteria were met for falsification.
dv2 will show an effect
p_t_2
is confirmed if analysis ttest_2 yields p.value<0.05
The result was p.value = 0.21 (FALSE)
The hypothesis is corroborated if dv2 is significant.
p_t_2
The hypothesis is falsified if dv2 is not significant.
!p_t_2
All criteria were met for falsification.
dv3 will show an effect
p_t_3
is confirmed if analysis ttest_3 yields p.value<0.05
The result was p.value = 0.02 (TRUE)
The hypothesis is corroborated if dv3 is significant.
p_t_3
The hypothesis is falsified if dv3 is not significant.
!p_t_3
All criteria were met for corroboration.
We now see that two hypotheses were falsified (yes, yes, I know you should not use p > 0.05 to falsify a prediction in real life, and this part of the example is formally wrong so I don’t also have to explain equivalence testing to readers not familiar with it – if that is you, read this, and know scienceverse will allow you to specify equivalence test as the criterion to falsify a prediction, see the example here). The third hypothesis is corroborated, even though, as above, this is a Type 1 error.
It might seem that the second approach, specifying each dv as it’s own hypothesis, is the way to go if you do not want to lower the alpha level to control for multiple comparisons. But take a look at the report of the study you have performed. You have made 3 predictions, of which 1 was corroborated. That is not an impressive success rate. Sure, mixed results happen, and you should interpret results not just based on the p-value (but on the strength of the experimental design, assumptions about power, your prior, the strength of the theory, etc.) but if these predictions were derived from the same theory, this set of results is not particularly impressive. Since researchers can never selectively report only those results that ‘work’ because this would be a violation of the code of research integrity, we should always be able to see the meager track record of predictions.If you don’t feel ready to make a specific predictions (and run the risk of sullying your track record) either do unplanned exploratory tests, and do not make claims based on their results, or preregister all possible tests you can think of, and massively lower your alpha level to control error rates (for example, genome-wide association studies sometimes use an alpha level of 5 x 10–8 to control the Type 1 erorr rate).
Thanks to Lisa DeBruine for feedback on an earlier draft of this blog post.
First, I agree with Yarkoni that almost all the proposals he makes in the section “Where to go from here?” are good suggestions. I don’t think they follow logically from his points about generalizability, as I detail below, but they are nevertheless solid suggestions a researcher should consider. Second, I agree that there are research lines in psychology where modelling more things as random factors will be productive, and a forceful manifesto (even if it is slightly less practical than similar earlier papers) might be a wake up call for people who had ignored this issue until now.
Beyond these two points of agreement, I found the main thesis in his article largely unconvincing. I don’t think there is a generalizability crisis, but the article is a nice illustration of why philosophers like Popper abandoned the idea of an inductive science. When Yarkoni concludes that “A direct implication of the arguments laid out above is that a huge proportion of the quantitative inferences drawn in the published psychology literature are so inductively weak as to be at best questionable and at worst utterly insensible.” I am primarily surprised he believes induction is a defensible philosophy of science. There is a very brief discussion of views by Popper, Meehl, and Mayo on page 19, but their work on testing theories is proposed as a probable not feasible solution – which is peculiar, because these authors would probably disagree with most of the points made by Yarkoni, and I would expect at least somewhere in the paper a discussion comparing induction against the deductive approach (especially since the deductive approach is arguably the dominant approach in psychology, and therefore none of the generalizability issues raised by Yarkoni are a big concern). Because I believe the article starts from a faulty position (scientists are not concerned with induction, but use deductive approaches) and because Yarkoni provides no empirical support for any of his claims that generalizability has led to huge problems (such as incredibly high Type 1 error rates), I remain unconvinced there is anything remotely close to the generalizability crisis he so evocatively argues for. The topic addressed by Yarkoni is very broad. It probably needs a book length treatment to do it justice. My review is already way too long, and I did not get into the finer details of the argument. But I hope this review helps to point out the parts of the manuscript where I feel important arguments lack a solid foundation, and where issues that deserve to be discussed are ignored.
Early in the introduction, Yarkoni says there is a “fast” and “slow” approach of drawing general conclusions from specific observations. Whenever people use words that don’t exactly describe what they mean, putting them in quotation marks is generally not a good idea. The “fast” and “slow” approaches he describes are not, I believe upon closer examination, two approaches “of drawing general conclusions from specific observations”.
The difference is actually between induction (the “slow” approach of generalizing from single observations to general observations) and deduction, as proposed by for example Popper. As Popper writes “According to the view that will be put forward here, the method of critically testing theories, and selecting them according to the results of tests, always proceeds on the following lines. From a new idea, put up tentatively, and not yet justified in any way—an anticipation, a hypothesis, a theoretical system, or what you will—conclusions are drawn by means of logical deduction.”
Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments’”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.
Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the e?ect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.” Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.
This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?
For me, this observation raised serious concerns about the statement Yarkoni makes that, simply from the titles of scientific articles, we can make a statement about whether authors make ‘fast’ or ‘slow’ generalizations. One reason is that Yarkoni examined titles from a scientific article that adheres to the publication manual of the APA. In the section on titles, the APA states: “A title should summarize the main idea of the manuscript simply and, if possible, with style. It should be a concise statement of the main topic and should identify the variables or theoretical issues under investigation and the relationship between them. An example of a good title is “Effect of Transformed Letters on Reading Speed.””. To me, it seems the authors are simply following the APA publication manual. I do not think their choice for a title provides us with any insight whatsoever about the tendency of authors to have a preference for ‘fast’ generalization. Again, this might be a minor point, but I found this an illustrative example of the strength of arguments in other places (see the next point for the most important example). Yarkoni needs to make a case that scientists are overgeneralizing, for there to be a generalizability crisis – but he does so unconvincingly. I sincerely doubt researchers expect their findings to generalize to all possible situations mentioned in the title, I doubt scientists believe titles are the place to accurately summarize limits of generalizability, and I doubt Yarkoni has made a strong point that psychologists overgeneralize based on this section. More empirical work would be needed to build a convincing case (e.g., code how researchers actually generalize their findings in a random selection of 250 articles, taking into account Gricean communication norms (especially the cooperative principle) in scientific articles).
After explaining that psychologists use statistics to test predictions based on experiments that are operationalizations of verbal theories, Yarkoni notes: “From a generalizability standpoint, then, the key question is how closely the verbal and quantitative expressions of one’s hypothesis align with each other.”
Yarkoni writes: “When a researcher verbally expresses a particular hypothesis, she is implicitly defining a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis. If the researcher subsequently asserts that a particular statistical procedure provides a suitable test of the verbal hypothesis, she is making the tacit but critical assumption that the universe of admissible observations implicitly defined by the chosen statistical procedure (in concert with the experimental design, measurement model, etc.) is well aligned with the one implicitly defined by the qualitative hypothesis. Should a discrepancy between the two be discovered, the researcher will then face a choice between (a) working to resolve the discrepancy in some way (i.e., by modifying either the verbal statement of the hypothesis or the quantitative procedure(s) meant to provide an operational parallel); or (b) giving up on the link between the two and accepting that the statistical procedure does not inform the verbal hypothesis in a meaningful way.”
I highlighted what I think is the critical point is in a bold font. To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.
If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed e?ects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned. As Yarkoni accurately summarizes based on an large multi-lab study on verbal overshadowing by Alogna: “given very conservative background assumptions, the massive Alogna et al. study—an initiative that drew on the efforts of dozens of researchers around the world—does not tell us much about the general phenomenon of verbal overshadowing. Under more realistic assumptions, it tells us essentially nothing.” This is also why Yarkoni’s first practical recommendation on how to move forward is to not solve the problem, but to do something else: “One perfectly reasonable course of action when faced with the difficulty of extracting meaningful, widely generalizable conclusions from e?ects that are inherently complex and highly variable is to opt out of the enterprise entirely.”
This is exactly the reason Popper (among others) rejected induction, and proposed a deductive approach. Why isn’t the alignment between theories and tests raised by Yarkoni a problem for the deductive approach proposed by Popper, Meehl, and Mayo? The reason is that the theory is tentatively posited as true, but in no way believed to be a complete representation of reality. This is an important difference. Yarkoni relies on an inductive approach, and thus the test needs to be aligned with the theory, and the theory defines “a set of admissible observations containing all of the hypothetical situations in which some measurement could be taken that would inform that hypothesis.” For deductive approaches, this is not true.
For philosophers of science like Popper and Lakatos, a theory is not a complete description of reality. Lakatos writes about theories: “Each of them, at any stage of its development, has unsolved problems and undigested anomalies. All theories, in this sense, are born refuted and die refuted.” Lakatos gives the example that Newton’s Principia could not even explain the motion of the moon when it was published. The main point here: All theories are wrong. The fact that all theories (or models) are wrong should not be surprising. Box’s quote “All models are wrong, some are useful” is perhaps best known, but I prefer Box (1976) on parsimony: “Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William Ockham (1285-1349) he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity (Ockham’s knife).” He follows this up by stating “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”
In a deductive approach, the goal of a theoretical model is to make useful predictions. I doubt anyone believes that any of the models they are currently working on is complete. Some researchers might follow an instrumentalist philosophy of science, and don’t expect their theories to be anything more than useful tools. Lakatos’s (1978) main contribution to philosophy of science was to develop a way we deal with our incorrect theories, admitting that all needed adjustment, but some adjustments lead to progressive research lines, and others to degenerative research lines.
In a deductive model, it is perfectly fine to posit a theory that eating ice-cream makes people happy, without assuming this holds for all flavors, across all cultures, at all temperatures, and is irrespective of the amount of ice-cream eaten previously, and many other factors. After all, it is just a tentatively model that we hope is simple enough to be useful, and that we expect to become more complex as we move forward. As we increase our understanding of food preferences, we might be able to modify our theory, so that it is still simple, but also allows us to predict the fact that eggnog and bacon flavoured ice-cream do not increase happiness (on average). The most important thing is that our theory is tentative, and posited to allow us to make good predictions. As long as the theory is useful, and we have no alternatives to replace it with, the theory will continue to be used – without any expectation that is will generalize to all possible situations. As Box (1976) writes: “Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory.” A discussion of this large gap between Yarkoni and deductive approaches proposed by Popper and Meehl, where Yarkoni thinks theories and tests need to align, and deductive approaches see theories as tentative and wrong, should be included, I think.
If we read Popper (but also on the statistical side the work of Neyman) we see induction as a possible goal in science is clearly rejected. Yarkoni mentions deductive approaches briefly in his section on adopting better standards, in the sub-section on making riskier predictions. I intuitively expected this section to be crucial – after all, it finally turns to those scholars who would vehemently disagree with most of Yarkoni’s arguments in the preceding sections – but I found this part rather disappointing. Strangely enough, Yarkoni simply proposes predictions as a possible solution – but since the deductive approach goes directly against the inductive approach proposed by Yarkoni, it seems very weird to just mention risky predictions as one possible solution, when it is actually a completely opposite approach that rejects most of what Yarkoni argues for. Yarkoni does not seem to believe that the deductive mode proposed by Popper, Meehl, and Mayo, a hypothesis testing approach that is arguably the dominant approach in most of psychology (Cortina & Dunlap, 1997; Dienes, 2008; Hacking, 1965), has a lot of potential. The reason he doubts severe tests of predictions will be useful is that “in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding” (Yarkoni, p. 19). This could be resolved if risky predictions were possible, which Yarkoni doubts.
Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.
When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests. It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.
Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.
If we ignore all points previous points, we can still read Yarkoni’s paper as a call to introduce more random factors in our experiments. This nicely complements recent calls to vary all factors you do not thing should change the conclusions you draw (Baribault et al., 2018), and classic papers on random effects (Barr et al., 2013; Clark, 1969; Cornfield & Tukey, 1956).
Yarkoni generalizes from the fact that most scientists model subjects as a random factor, and then asks why scientists generalize to all sorts of other factors that were not in their models. He asks “Why not simply model all experimental factors, including subjects, as fixed e?ects”. It might be worth noting in the paper that sometimes researchers model subjects as fixed effects. For example, Fujisaki and Nishida (2009) write: “Participants were the two authors and five paid volunteers” and nowhere in their analyses do they assume there is any meaningful or important variation across individuals. In many perception studies, an eye is an eye, and an ear is an ear – whether from the author, or a random participant dragged into the lab from the corridor.
In other research areas, we do model individuals as a random factor. Yarkoni says we model stimuli as a random factor because: “The reason we model subjects as random e?ects is not that such a practice is objectively better, but rather, that this specification more closely aligns the meaning of the quantitative inference with the meaning of the qualitative hypothesis we’re interested in evaluating”. I disagree. I think we model certain factor as random effects because we have a high prior these factors influence the effect, and leaving them out of the model would reduce the strength of our prediction. Leaving them out reduces the probability a test will show we are wrong, if we are wrong. It impacts the severity of the test. Whether or not we need to model factors (e.g., temperature, the experimenter, or day of the week) as random factors because not doing so reduces the severity of a test is a subjective judgments. Research fields need to decide for themselves. It is very well possible more random factors are generally needed, but I don’t know how many, and doubt it will ever be as severe are the ‘generalizability crisis’ suggests. If it is as severe as Yarkoni suggests, some empirical demonstrations of this would be nice. Clark (1973) showed his language-as-fixed-effect fallacy using real data. Barr et al (2013) similarly made their point based on real data. I currently do not find the theoretical point very strong, but real data might convince me otherwise.
The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic. Similarly, Cornfield & Tukey (1956) more pragmatically list options ranging from ignoring factors altogether, to randomizing them, or including them as a factor, and note “Each of these attitudes is appropriate in its place. In every experiment there are many variables which could enter, and one of the great skills of the experimenter lies in leaving out only inessential ones.” Just as pragmatically, Clark (1973) writes: “The wide-spread capitulation to the language-as-fixed-effect fallacy, though alarming, has probably not been disastrous. In the older established areas, most experienced investigators have acquired a good feel for what will replicate on a new language sample and what will not. They then design their experiments accordingly.” As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why. In some ways, Yarkoni’s point generalizes the argument that most findings in psychology do not generalize to non-WEIRD populations (Henrich et al., 2010), and it has the same weakness. WEIRD is a nice acronym, but it is just a completely random collection of 5 factors that might limit generalizability. The WEIRD acronym functions more as a nice reminder that boundary conditions exist, but it does not allow us to predict when they exist, or when they matter enough to be included in our theories. Currently, there is a gap between the factors that in theory could matter, and the factors that we should in practice incorporate. Maybe it is my pragmatic nature, but without such a discussion, I think the paper offers relatively little progress compared to previous discussions about generalizability (of which there are plenty).
A large part of Yarkoni’s argument is based on the fact that theories and tests should be closely aligned, while in a deductive approach based on severe tests of predictions, models are seen as simple, tentative, and wrong, and this is not considered a problem. Yarkoni does not convincingly argue researchers want to generalize extremely broadly (although I agree papers would benefit from including Constraints on Generalizability statements a proposed by Simons and colleagues (2017), but mainly because this improves falsifiability, not because it improves induction), and even if there is the tendency to overclaim in articles, I do not think this leads to an inferential crisis. Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice. Until Yarkoni does the latter convincingly, I don’t think the generalizability crisis as he sketches it is something that will keep me up at night.
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Ravenzwaaij, D. van, White, C. N., Boeck, P. D., & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3). https://doi.org/10.1016/j.jml.2012.11.001
Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10/gdm28w
Clark, H. H. (1969). Linguistic processes in deductive reasoning. Psychological Review, 76(4), 387–404. https://doi.org/10.1037/h0027578
Cornfield, J., & Tukey, J. W. (1956). Average Values of Mean Squares in Factorials. The Annals of Mathematical Statistics, 27(4), 907–949. https://doi.org/10.1214/aoms/1177728067
Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161.
Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Palgrave Macmillan.
Fujisaki, W., & Nishida, S. (2009). Audio–tactile superiority over visuo–tactile and audio–visual combinations in the temporal resolution of synchrony perception. Experimental Brain Research, 198(2), 245–259. https://doi.org/10.1007/s00221-009-1870-x
Hacking, I. (1965). Logic of Statistical Inference. Cambridge University Press.
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466(7302), 29–29.
Lakens, D. (2020). The Value of Preregistration for Psychological Science: A Conceptual Analysis. Japanese Psychological Review. https://doi.org/10.31234/osf.io/jbh4w
Munafò, M. R., & Smith, G. D. (2018). Robust research needs many lines of evidence. Nature, 553(7689), 399–401. https://doi.org/10.1038/d41586-018-01023-3
Orben, A., & Lakens, D. (2019). Crud (Re)defined. https://doi.org/10.31234/osf.io/96dpy
Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on Generality (COG): A Proposed Addition to All Empirical Papers. Perspectives on Psychological Science, 12(6), 1123–1128. https://doi.org/10.1177/1745691617708630
Now maybe a single paper with 30 tests is not ‘long runnerish’ enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let’s assume we are interested in a meta-analysis worth of 200 studies that examine a topic where the true effect size is 0 for each test. We can plot the frequency of Type 1 error rates for 1 million sets of 200 tests.
Code
For a related paper on alpha levels that in practical situations can not be 5%, see https://psyarxiv.com/erwvk/ by Casper Albers.
Three years after launching my first massive open online course (MOOC) ‘Improving Your Statistical Inferences’ on Coursera, today I am happy to announce a second completely free online course called ‘Improving Your Statistical Questions’. My first course is a collection of lessons about statistics and methods that we commonly use, but that I wish I had known how to use better when I was taking my first steps into empirical research. My new course is a collection of lessons about statistics and methods that we do not yet commonly use, but that I wish we start using to improve the questions we ask. Where the first course tries to get people up to speed about commonly accepted best practices, my new course tries to educate researchers about better practices. Most of the modules consist of topics in which there has been more recent developments, or at least increasing awareness, over the last 5 years.
Read the complete Statistical Guidelines here The cornerstone of research credibility and reproducibility lies in ensuring that research is conducted and reported in an accurate manner. These values are reflected in PLOS ONE Criteria for
This blog post is now included in the paper “Sample size justification” available at PsyArXiv.
A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends the ideas in this blog post is available at: https://psyarxiv.com/ts4r6
In 1957 Neyman wrote: “it appears desirable to determine the level of significance in accordance with quite a few circumstances that vary from one particular problem to the next.” Despite this good advice, social scientists developed the norm to always use an alpha level of 0.05 as a threshold when making predictions. In this blog post I will explain how you can set the alpha level so that it minimizes the combined Type 1 and Type 2 error rates (thus efficiently making decisions), or balance Type 1 and Type 2 error rates. You can use this approach to justify your alpha level, and guide your thoughts about how to design studies more efficiently.
Neyman (1933) provides an example of the reasoning process he believed researchers should go through. He explains how a researcher might have derived an important hypothesis that H0 is true (there is no effect), and will not want to ‘throw it aside too lightly’. The researcher would choose a ow alpha level (e.g., 0.01). In another line of research, an experimenter might be interesting in detecting factors that would lead to the modification of a standard law, where the “importance of finding some new line of development here outweighs any loss due to a certain waste of effort in starting on a false trail”, and Neyman suggests to set the alpha level to for example 0.1.
As you perform lines of research the data you collect are used as a guide to continue or abandon a hypothesis, to use one paradigm or another. One goal of well-designed experiments is to control the error rates as you make these decisions, so that you do not fool yourself too often in the long run.
Many researchers implicitly assume that Type 1 errors are more problematic than Type 2 errors. Cohen (1988) suggested a Type 2 error rate of 20%, and hence to aim for 80% power, but wrote “.20 is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of .20/.05, i.e., that Type I errors are of the order of four times as serious as Type II errors. This .80 desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value ad hoc”. More recently, researchers have argued that false negative constitute a much more serious problem in science (Fiedler, Kutzner, & Krueger, 2012). I always ask my 3rd year bachelor students: What do you think? Is a Type 1 error in your next study worse than a Type 2 error?
Last year I listened to someone who decided whether new therapies would be covered by the German healthcare system. She discussed Eye Movement Desensitization and Reprocessing (EMDR) therapy. I knew that the evidence that the therapy worked was very weak. As the talk started, I hoped they had decided not to cover EMDR. They did, and the researcher convinced me this was a good decision. She said that, although no strong enough evidence was available that it works, the costs of the therapy (which can be done behind a computer) are very low, it was applied in settings where no really good alternatives were available (e.g., inside prisons), and risk of negative consequences was basically zero. They were aware of the fact that there was a very high probability that EMDR was a Type 1 error, but compared to the cost of a Type 2 error, it was still better to accept the treatment. Another of my favorite examples comes from Field et al. (2004) who perform a cost-benefit analysis on whether to intervene when examining if a koala population is declining, and show the alpha should be set at 1 (one should always assume a decline is occurring and intervene).
Making these decisions is difficult – but it is better to think about them, then to end up with error rates that do not reflect the errors you actually want to make. As Ulrich and Miller (2019) describe, the long run error rates you actually make depend on several unknown factors, such as the true effect size, and the prior probability that the null hypothesis is true. Despite these unknowns, you can design studies that have good error rates for an effect size you are interested in, given some sample size you are planning to collect. Let’s see how.
Mudge, Baker, Edge, and Houlahan (2012) explain how researchers might want to minimize the total combined error rate. If both Type 1 as Type 2 errors are costly, then it makes sense to optimally reduce both errors as you do studies. This would make decision making overall most efficient. You choose an alpha level that, when used in the power analysis, leads to the lowest combined error rate. For example, with a 5% alpha and 80% power, the combined error rate is 5+20 = 25%, and if power is 99% and the alpha is 5% the combined error rate is 1 + 5 = 6%. Mudge and colleagues show that the increasing or reducing the alpha level can lower the combined error rate. This is one of the approaches we mentioned in our ‘Justify Your Alpha’ paper from 2018.
When we wrote ‘Justify Your Alpha’ we knew it would be a lot of work to actually develop methods that people can use. For months, I would occasionally revisit the code Mudge and colleagues used in their paper, which is an adaptation of the pwr library in R, but the code was too complex and I could not get to the bottom of how it worked. After leaving this aside for some months, during which I improved my R skills, some days ago I took a long shower and suddenly realized that I did not need to understand the code by Mudge and colleagues. Instead of getting their code to work, I could write my own code from scratch. Such realizations are my justification for taking showers that are longer than is environmentally friendly.
If you want to balance or minimize error rates, the tricky thing is that the alpha level you set determines the Type 1 error rate, but through it’s influence on the statistical power, also influenced the Type 2 error rate. So I wrote a function that examines the range of possible alpha levels (from 0 to 1) and minimizes either the total error (Type 1 + Type 2) or minimizes the difference between the Type 1 and Type 2 error rates, balancing the error rates. It then returns the alpha (Type 1 error rate) and the beta (Type 2 error). You can enter any analytic power function that normally works in R and would output the calculated power.
Below is the version of the optimal_alpha function used in this blog. Yes, I am defining a function inside another function and this could all look a lot prettier – but it works for now. I plan to clean up the code when I archive my blog posts on how to justify alpha level in a journal, and will make an R package when I do.
The code requires requires you to specify the power function (in a way that the code returns the power, hence the $power at the end) for your test, where the significance level is a variable ‘x’. In this power function you specify the effect size (such as the smallest effect size you are interested in) and the sample size. In my experience, sometimes the sample size is determined by factors outside the control of the researcher. For example, you are working with a existing data, or you are studying a sample size that is limited (e.g., all students in a school). Other times, people have a maximum sample size they can feasibly collect, and accept the error rates that follow from this feasibility limitation. If your sample size is not limited, you can increase the sample size until you are happy with the error rates.
The code calculates the Type 2 error (1-power) across a range of alpha values. For example, we want to calculate the optimal alpha level for a independent t-test. Assume our smallest effect size of interest is d = 0.5, and we are planning to collect 100 participants in each group. We would normally calculate power as follows:
pwr.t.test(d = 0.5, n = 100, sig.level = 0.05, type = ‘two.sample’, alternative = ‘two.sided’)$power
This analysis tells us that we have 94% power with a 5% alpha level for our smallest effect size of interest, d = 0.5, when we collect 100 participants in each condition.
If we want to minimize our total error rates, we would enter this function in our optimal_alpha function (while replacing the sig.level argument with ‘x’ instead of 0.05, because we are varying the value to determine the lowest combined error rate).
res = optimal_alpha(power_function = pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”)
res$alpha
## [1] 0.05101728
res$beta
## [1] 0.05853977
We see that an alpha level of 0.051 slightly improved the combined error rate, since it will lead to a Type 2 error rate of 0.059 for a smallest effect size of interest of d = 0.5. The combined error rate is 0.11. For comparison, lowering the alpha level to 0.005 would lead to a much larger combined error rate of 0.25.
What would happen if we had decided to collect 200 participants per group, or only 50? With 200 participants per group we would have more than 99% power for d = 0.05, and relatively speaking, a 5% Type 1 error with a 1% Type 2 error is slightly out of balance. In the age of big data, we nevertheless researchers use such suboptimal error rates this all the time due to their mindless choice for an alpha level of 0.05. When power is large the combined error rates can be smaller if the alpha level is lowered. If we just replace 100 by 200 in the function above, we see the combined Type 1 and Type 2 error rate is the lowest if we set the alpha level to 0.00866. If you collect large amounts of data, you should really consider lowering your alpha level.
If the maximum sample size we were willing to collect was 50 per group, the optimal alpha level to reduce the combined Type 1 and Type 2 error rates is 0.13. This means that we would have a 13% probability of deciding there is an effect when the null hypothesis is true. This is quite high! However, if we had used a 5% Type 1 error rate, the power would have been 69.69%, with a 30.31% Type 2 error rate, while the Type 2 error rate is ‘only’ 16.56% after increasing the alpha level to 0.13. We increase the Type 1 error rate by 8%, to reduce the Type 2 error rate by 13.5%. This increases the overall efficiency of the decisions we make.
This example relies on the pwr.t.test function in R, but any power function can be used. For example, the code to minimize the combined error rates for the power analysis for an equivalence test would be:
res = optimal_alpha(power_function = “powerTOSTtwo(alpha=x, N=200, low_eqbound_d=-0.4, high_eqbound_d=0.4)”)
You can choose to minimize the combined error rates, but you can also decide that it makes most sense to you to balance the error rates. For example, you think a Type 1 error is just as problematic as a Type 2 error, and therefore, you want to design a study that has balanced error rates for a smallest effect size of interest (e.g., a 5% Type 1 error rate and a 5% Type 2 error rate). Whether to minimize error rates or balance them can be specified in an additional argument in the function. The default it to minimize, but by adding error = “balance” an alpha level is given so that the Type 1 error rate equals the Type 2 error rate.
res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “balance”)
res$alpha
## [1] 0.05488516
res$beta
## [1] 0.05488402
Repeating our earlier example, the alpha level is 0.055, such that the Type 2 error rate, given the smallest effect size of interest and the and the sample size, is also 0.055. I feel that even though this does not minimize the overall error rates, it is a justification strategy for your alpha level that often makes sense. If both Type 1 and Type 2 errors are equally problematic, we design a study where we are just as likely to make either mistake, for the effect size we care about.
So far we have assumed a Type 1 error and Type 2 error are equally problematic. But you might believe Cohen (1988) was right, and Type 1 errors are exactly 4 times as bad as Type 2 errors. Or you might think they are twice as problematic, or 10 times as problematic. However you weigh them, as explained by Mudge et al., 2012, and Ulrich & Miller, 2019, you should incorporate those weights into your decisions.
The function has another optional argument, costT1T2, that allows you to specify the relative cost of Type1:Type2 errors. By default this is set to 1, but you can set it to 4 (or any other value) such that Type 1 errors are 4 times as costly as Type 2 errors. This will change the weight of Type 1 errors compared to Type 2 errors, and thus also the choice of the best alpha level.
res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, costT1T2 = 4)
res$alpha
## [1] 0.01918735
res$beta
## [1] 0.1211773
Now, the alpha level that minimized the weighted Type 1 and Type 2 error rates is 0.019.
Similarly, you can take into account prior probabilities that either the null is true (and you will observe a Type 1 error), or that the alternative hypothesis is true (and you will observe a Type 2 error). By incorporating these expectations, you can minimize or balance error rates in the long run (assuming your priors are correct). Priors can be specified using the prior_H1H0 argument, which by default is 1 (H1 and H0 are equally likely). Setting it to 4 means you think the alternative hypothesis (and hence, Type 2 errors) are 4 times more likely than that the null hypothesis (and Type 1 errors).
res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, prior_H1H0 = 2)
res$alpha
## [1] 0.07901679
res$beta
## [1] 0.03875676
If you think H1 is four times more likely to be true than H0, you need to worry less about Type 1 errors, and now the alpha that minimizes the weighted error rates is 0.079. It is always difficult to decide upon priors (unless you are Omniscient Jones) but even if you ignore them, you are making the decision that H1 and H0 are equally plausible.
You can’t abandon a practice without an alternative. Minimizing the combined error rate, or balancing error rates, provide two alternative approaches to the normative practice of setting the alpha level to 5%. Together with the approach to reduce the alpha level as a function of the sample size, I invite you to explore ways to set error rates based on something else than convention. A downside of abandoning mindless statistics is that you need to think of difficult questions. How much more negative is a Type 1 error than a Type 2 error? Do you have an ideas about the prior probabilities? And what is the smallest effect size of interest? Answering these questions is difficult, but considering them is important for any study you design. The experiments you make might very well be more informative, and more efficient. So give it a try.
The post Worms in the Big Apple: Identifying Patterns of Toxocariasis Infection in New York City appeared first on EveryONE.
The Black Death, a pandemic at its height in Europe during the mid-14th century, was a virulent killer. It was so effective that it wiped out approximately one third of Europe’s population. Recent studies have shown that the elderly and the sick were most susceptible. But was the Black Death a “smart” killer?
A recent PLOS ONE study indicates that the Black Death’s virulence might have affected genetic variation in the surviving human population by reducing frailty, resulting in less virulent subsequent outbreaks of the plague. By examining the differences in survival rates and mortality risks in both pre-Black Death and post-Black Death samples of a London population—in combination with other, extrinsic factors, like differences in diet between the two groups—the researcher found that in London, on average, people lived longer following the plague than they did before it, despite repeated plague outbreaks. In other words, in terms of genetic variation, the Black Death positively affected the health of the surviving population.
To uncover differences in the health of medieval Londoners, Dr. Sharon DeWitte of the University of South Carolina examined 464 pre-Black Death individuals from three cemeteries and 133 post-Black Death individuals from one. She chose a diverse range of samples for a comprehensive view of the population, including both the rich and the poor, and women and children, but targeted one geographic location: London.
The ages-at-death of the samples were determined by calculating best estimates—in statistics these are called point estimates—based on particular indicators of age found on the skeletons’ hip and skull bones. Individuals’ ages were then evaluated against those in the Anthropological Database of Odense University, a pre-existing database comprising the Smithsonian’s Terry Collection and prior age-at-death data from 17th-century Danish parish records.
After estimating how old these individuals were when they died and comparing the age indicators against the Odense reference tool, the author conducted statistical analyses on the data to examine what the ages-at-death could tell us about the proportion of pre- and post- Black Death medieval Londoners who lived to a ripe old age, as well as the likelihood of death.
Survivorship was estimated using the Kaplan-Meier Estimator, a function used to indicate a quantity based on known data; in this case the function evaluated how long people lived in a given time period (pre-Black Death or post-Black Death). The calculated differences were significant: In particular, the proportion of adults who lived beyond the age of 50 from the post-Black Death group was much greater than those from the pre-Black Death group.
In the pre-Black Death group, death was most likely to occur between the ages of 10 and 19, as seen above.
The Kaplan-Meier survival plot shows how the chances of survival, which decrease with age, differ for Pre-Black Death and Post-Black Death groups, as seen below.
As the survival plot indicates, post-Black Death Londoners lived longer than there Pre-Black Death predecessors.
Finally, Dr. DeWitte estimated the risk of mortality by applying the age data to the statistical model known as the Gompertz hazard, which shows the typical pattern of increased risk in mortality with age. She found that overall post-Black Death Londoners faced lower risks of mortality than their pre-Black Death counterparts.
To make long and complicated methodology short, these analyses indicate that post-Black Death Londoners appear to have lived longer than pre-Black Death Londoners. The author estimates that the general population of London enjoyed a period of about 200 years of improved survivorship, based on these results.
The virulent killer, the Black Death, may have helped select for a healthier London by influencing genetic variation, at least in the short term. However, to better understand the improved quality of life of post-Black Death London, the author suggests further study to disentangle two major factors: the selectivity of the Black Death, coupled with improvements in lifestyle for post-Black Death individuals. For example, the massive depopulation in Europe resulted in increased wages for workers and improvements to diet following the plague, which also likely improved health for medieval Londoners. By unraveling intrinsic, biological changes in genetic variation from outside extrinsic factors like improvements in diet, it may be possible to better understand the aftermath of one of the most devastating killers in infectious disease history.
The EveryONE blog has more on the medieval killer here.
Citation: DeWitte SN (2014) Mortality Risk and Survival in the Aftermath of the Medieval Black Death. PLoS ONE 9(5): e96513. doi:10.1371/journal.pone.0096513
Image 1: The Black Death from Simple Wikipedia
Image 2: pone.0096513
Image 3: pone.0096513
The post Statistics Predicted a Healthier Medieval London Following the Black Death appeared first on EveryONE.