How I tried to get a paper that I own retracted: A journey and call to action against Prime Scholars

This is a Guest Post by Noah van Dongen


Right after the replication emergency in mental science, numerous clinicians have embraced rehearses that support the vigor and straightforwardness of the logical cycle, including preregistration, information sharing, code sharing, and enormous reproducibility studies
 
not Noah van Dongen, but somebody else (2022)

This is a true story about how I tried to get a paper that I own retracted. The copyright of the paper in question is attributed to me. Several requests and demands for removal were issued, but the journal has not yet fully complied. Of course, this might be due to the fact that I did not write the paper, the paper is an incoherent mess of academic gibberish, and Prime Scholars is the publisher of the journal. This post tells my story. It ends with a call to action against Prime Scholars and similar outfits.

For those of you who are unfamiliar with Prime Scholar, this is what Wikipedia has to say about them:

As Bishop notes, some famous, though long dead, authors are counted among the people that publish in Prime Scholars’ journals.1

Apparently, I can now count myself among the illustrious company of Hesse, Bronte, and Whitman. For I too, without my knowledge and against my wishes, am now a scholar that published a word soup essay in a Prime Scholar journal.

The paper that I never wrote

 

I will not keep you in suspense any longer and tell you which journal and publication I am talking about. Acta Psychopathologica is one of the 56 journals owned by Prime Scholars. It claims to have a Journal Impact Factor of 2.15 (or 2.4 according to their about page) and has 524 citations reported on Google Scholar (or 447 according to Google Scholar). The journal has published 9 volumes, with some as many as 9 issues, and a total of 9 special issues. The academic trainwreck attributed to me was published in May 2022, as part of the fourth issue of volume 8. There are four other publications in this issue. For as far as I can tell, all of their authors are existing and living scientists. And all publications are of similar quality: syntactically it looks like English, but it reads like a bad acid trip (or the ramblings of a drunk philosopher).

And now, the word vomit in question. The literary disaster is titled “Phenomena can be Characterized as General Patterns in Observations of Psychology Theories.” and is attributed to a single author, Noah Van Dongen. The efficiency of the review process is astounding:

  • 1 April 2022: submission (nice touch on April fools)

  • 4 April 2022: editor assigned

  • 18 April 2022: reviewed

  • 25 April 2022: revised

  • 2 May 2022: published

One month and a day between submission and publication! Receiving and correcting proofs from a journal’s copy editor usually takes me more than a month! It is also very neat that, apart from the submission to editor assignment, everything else was done in exactly a week; like clockwork! I wish my colleagues and I could work this organized and systematic. The terrible text(ual) trifle2 cannot be found on Google Scholar, it does not show up in the Google Scholar page of the journal, and its DOI does not exist. Services, like Google Scholar and Researchgate, who typically notify me when something academic is published under my name, have not picked up on this addition to my oeuvre. The only reason I know of its existence is that a colleague came across it in a Google search for the definition of “phenomena”.

About the content of the linguistic barf, just like the other papers in Issue 4 of Volume 8, it consists of a collection of English sentences that appear to approach syntactic correctness, though devoid of any clear meaning. The only thing it has going for it, is that it is blessedly short. It is good to read that there were no conflicts of interest, but it’s too bad that they misspelled my name. In the Netherlands, our surnames can have prefixes, like “van” or “van der” (which translates to “from”), which are not capitalized. My surname is spelled “van Dongen” not “Van Dongen”. Just pointing this out. However, considering the quality of the commentary, it is surprising they got so close to getting my personal information correct. Yes, this verbose vacuity is published as a commentary, though for the life of me I cannot figure out what it is supposed to be commenting on.

Trying to make sense of the senseless drivel, it seems like the true authors have taken part of the introduction from a preprint (that I did write) and ran it through a thesaurus to avoid being flagged for plagiarism, creating what is called tortured phrases. The paper that they used as the mold seems to be an early version of Productive Explanation: A Framework for Evaluating Explanations in Psychological Science, which was posted on PsyArXiv on 13 April 20223. For comparison, here are the first sentences of the original and the forgery, respectively:

In the wake of the replication crisis in psychological science, many psychologists have adopted practices that bolster the robustness and transparency of the scientific process, including preregistration (Chambers, 2013), data sharing (Wicherts et al., 2006), code sharing, and massive reproducibility studies (e.g., Aarts et al., 2015; Walters, 2020).

Right after the replication emergency in mental science, numerous clinicians have embraced rehearses that support the vigor and straightforwardness of the logical cycle, including preregistration, information sharing, code sharing, and enormous reproducibility studies.

I did not take the time to figure out which other sentences they used for the figurative butchery (or is ‘literal’ more appropriate here?) and what kind of procedure they used to end up with this collection of sentences. It looks like they just selected a certain part of the introduction, but I am not sure.

The process of getting the paper retracted and succeeding partially

 

I think this is enough for setting the stage. Let me now tell you about my journey of getting the verbal catastrophe retracted, of which I explicitly own the copyrights. This story started in November of 2022 when a colleague stumbled across the linguistic salad. Ironically enough, we were working on the original, wanting to improve our definition of phenomena, and searching for definitions by others that we might be able to use. It is quite a surreal experience to see your credentials on work you know you didn’t produce while simultaneously searching your memory for what you could have done to make this happen.

As a true academic, I acted immediately, ten days later. My first attempt was emailing the journal requesting the removal of the textual horror. This was on 8 December 2022.4 I gave the journal ample time to respond (or I forgot about this problem due to other stuff that was going on at the moment). But, after three months without reply, I contacted the legal team of the University of Amsterdam. They were very understanding and wanted to be of assistance.

On 11 2023, the UvA’s legal team emailed Acta Psychopathologica demanding the retraction of the atrocious article and threatened legal action if they did not comply. Acta Psychopathologica did not respond to this email either. On 20 April, the legal team sent a reminder, to which they also did not reply. At the moment, Acta Psychologica has not responded to any of our messages.

The UvA’s legal team had also advised me to report the identity theft to the Rijksdienst Identiteitsgegevens (RVIG; translation: the identity data safety services of the Dutch government), which I did on 11 April 2023. Conveniently, the RVIG has an online form for this. However, identity theft is usually about the illegitimate use of your credit cards or passports. It was more than a bit awkward to write about how you have been impersonated to publish shoddy scientific work in (a website that calls itself a) journal. Nine days after I submitted the form, the RVIG contacted me. They were very understanding, but told me there was nothing they could do for me. They advised me to report the crime to the police.

Again, other responsibilities got in the way. About a month later, I called the police on 1 June 2023. The central operator noted down the specifics of my predicament and told me that the department responsible for identity theft would contact me to make an appointment, which they did on Saturday 3 June 2023. As it turns out, reporting identity theft must be done in person at the police station. Maybe this is to make sure that it is actually you that is reporting the theft of your identity. On 13 June 2023, I went to my appointment to report the crime, which took about an hour. The officer taking my report was friendly and understanding. Actually, she was very understanding considering the curiousness of the situation. Noteworthy about this experience, is that she wrote the report from my (first person) perspective, but in her own words. There are many new experiences I am gaining throughout this journey, and this was one of the strange ones. Reading something as if you said it, correct in terms of content, though not in a way that you would say, is surreal to say the least. You are instantly aware of your own idiolect. Or at least, that is what I experienced.

A week later, I informed the UvA’s legal team of the police report. They promptly sent another email to Acta Psychopathologica requesting them again to remove the semantic puree, though this time adding that the crime had been reported to the police and that legal action would follow.

The UvA’s legal team also contacted the Editors-in-Chief of Acta Psychopathologica to request them to remove the language pit stain. They received two replies to this request, which can be summarized as: I did not actually do anything for this journal and I would like to be removed from the editorial board.

On 7 July 2023 the horrendous word swirl was no longer visible on the website. There are now only four papers in Volume 8 Issue 4 instead of five. However, the pdf version of the paper can still be reached and I am still listed as an author on the Prime Scholars website.

What is next?

 

My paper is not the only instance of identity theft in Acta Psychopathologica and other Prime Scholars journals. Currently, I am contacting the other authors in Acta Psychopathologica one by one to ask if they wrote the paper that is attributed to them but this is a slow process. My aim is to make them aware of the fraud and start requesting the traction of the fraudulent papers.

I’ve also come across this article in The Times Higher Education, which mentions legal actions being undertaken. I’m trying to find out if legal actions against Prime Scholars are indeed in the works. If so, I hope I can join them. If not, I want to get them started.

What you can do to help

  1. Check if you or people you know have papers attributed to them in a Prime Scholar journal.
  2. Contact me if you are also a victim of identity theft and want to join legal actions against Prime Scholars.
  3. Share this story and (ask people to) take action against Prime Scholars.

Generating academic articles that appear authentic has become much easier now large language models like ChatGPT have arrived on the scene. I think it is safe to assume that we don’t want fake papers to start invading our academic corpusses; devaluing our work and eroding the public’s trust in science. This does not seem to be a problem that will solve itself. We need to take active steps against these practices and we need to act now!

One last thought about publishers

Don’t you think that ‘respectable’ academic publishers should accept some responsibility for this predicament we find ourselves? If outfits like Prime Scholars are making a mockery of their business and start poisoning the well, should they not also undertake (legal) steps to protect their profession? 

 
 

1 Also, see this article on Retraction Watch?

2 For the interested, a trifle is a layered dessert of English origin. In Friends’ episode 9 of season 6, Rachel tries to make this dessert for Thanksgiving and accidentally adds beef sautéed with peas and onion to the dessert. The result still sounds better than the paper in question.?

3 The astute reader realized that this is 12 days after it was supposedly submitted to Acta Psychopathologica. For everybody else is now also aware due to this informative footnote.?

4 I know you Americans like to put the month in front of the date, that just looks wrong.?

Not All Flexibility P-Hacking Is, Young Padawan

During a recent workshop on Sample Size Justification an early career researcher asked me: “You recommend sequential analysis in your paperfor when effect sizes are uncertain, where researchers collect data, analyze the data, stop when a test is significant, or continue data collection when a test is not significant, and, I don’t want to be rude, but isn’t this p-hacking?”

In linguistics there is a term for when children apply a rule they have learned to instances where it does not apply: Overregularization. They learn ‘one cow, two cows’, and use the +s rule for plural where it is not appropriate, such as ‘one mouse, two mouses’ (instead of ‘two mice’). The early career researcher who asked me if sequential analysis was a form of p-hacking was also overregularizing. We teach young researchers that flexibly analyzing data inflates error rates, is called p-hacking, and is a very bad thing that was one of the causes of the replication crisis. So, they apply the rule ‘flexibility in the data analysis is a bad thing’ to cases where it does not apply, such as in the case of sequential analyses. Yes, sequential analyses give a lot of flexibility to stop data collection, but it does so while carefully controlling error rates, with the added bonus that it can increase the efficiency of data collection. This makes it a good thing, not p-hacking.

 

Children increasingly use correct language the longer they are immersed in it. Many researchers are not yet immersed in an academic environment where they see flexibility in the data analysis applied correctly. Many are scared to do things wrong, which risks becoming overly conservative, as the pendulum from ‘we are all p-hacking without realizing the consequences’ swings back to far to ‘all flexibility is p-hacking’. Therefore, I patiently explain during workshops that flexibility is not bad per se, but that making claims without controlling your error rate is problematic.

In a recent podcast episode of ‘Quantitude’ one of the hosts shared a similar experience 5 minutes into the episode. A young student remarked that flexibility during the data analysis was ‘unethical’. The remainder of the podcast episode on ‘researcher degrees of freedom’ discussed how flexibility is part of data analysis. They clearly state that p-hacking is problematic, and opportunistic motivations to perform analyses that give you what you want to find should be constrained. But they then criticized preregistration in ways many people on Twitter disagreed with. They talk about ‘high priests’ who want to ‘stop bad people from doing bad things’ which they find uncomfortable, and say ‘you can not preregister every contingency’. They remark they would be surprised if data could be analyzed without requiring any on the fly judgment.

Although the examples they gave were not very good1 it is of course true that researchers sometimes need to deviate from an analysis plan. Deviating from an analysis plan is not p-hacking. But when people talk about preregistration, we often see overregularization: “Preregistration requires specifying your analysis plan to prevent inflation of the Type 1 error rate, so deviating from a preregistration is not allowed.” The whole point of preregistration is to transparently allow other researchers to evaluate the severity of a test, both when you stick to the preregistered statistical analysis plan, as when you deviate from it. Some researchers have sufficient experience with the research they do that they can preregister an analysis that does not require any deviations2, and then readers can see that the Type 1 error rate for the study is at the level specified before data collection. Other researchers will need to deviate from their analysis plan because they encounter unexpected data. Some deviations reduce the severity of the test by inflating the Type 1 error rate. But other deviations actually get you closer to the truth. We can not know which is which. A reader needs to form their own judgment about this.

A final example of overregularization comes from a person who discussed a new study that they were preregistering with a junior colleague. They mentioned the possibility of including a covariate in an analysis but thought that was too exploratory to be included in the preregistration. The junior colleague remarked: “But now that we have thought about the analysis, we need to preregister it”. Again, we see an example of overregularization. If you want to control the Type 1 error rate in a test, preregister it, and follow the preregistered statistical analysis plan. But researchers can, and should, explore data to generate hypotheses about things that are going on in their data. You can preregister these, but you do not have to. Not exploring data could even be seen as research waste, as you are missing out on the opportunity to generate hypotheses that are informed by data. A case can be made that researchers should regularly include variables to explore (e.g., measures that are of general interest to peers in their field), as long as these do not interfere with the primary hypothesis test (and as long as these explorations are presented as such).

In the book “Reporting quantitative research in psychology: How to meet APA Style Journal Article Reporting Standards” by Cooper and colleagues from 2020 a very useful distinction is made between primary hypotheses, secondary hypotheses, and exploratory hypotheses. The first consist of the main tests you are designing the study for. The secondary hypotheses are also of interest when you design the study – but you might not have sufficient power to detect them. You did not design the study to test these hypotheses, and because the power for these tests might be low, you did not control the Type 2 error rate for secondary hypotheses. You canpreregister secondary hypotheses to control the Type 1 error rate, as you know you will perform them, and if there are multiple secondary hypotheses, as Cooper et al (2020) remark, readers will expect “adjusted levels of statistical significance, or conservative post hoc means tests, when you conducted your secondary analysis”.

If you think of the possibility to analyze a covariate, but decide this is an exploratory analysis, you can decide to neither control the Type 1 error rate nor the Type 2 error rate. These are analyses, but not tests of a hypothesis, as any findings from these analyses have an unknown Type 1 error rate. Of course, that does not mean these analyses can not be correct in what they reveal – we just have no way to know the long run probability that exploratory conclusions are wrong. Future tests of the hypotheses generated in exploratory analyses are needed. But as long as you follow Journal Article Reporting Standards and distinguish exploratory analyses, readers know what the are getting. Exploring is not p-hacking.

People in psychology are re-learning the basic rules of hypothesis testing in the wake of the replication crisis. But because they are not yet immersed in good research practices, the lack of experience means they are overregularizing simplistic rules to situations where they do not apply. Not all flexibility is p-hacking, preregistered studies do not prevent you from deviating from your analysis plan, and you do not need to preregister every possible test that you think of. A good cure for overregularization is reasoning from basic principles. Do not follow simple rules (or what you see in published articles) but make decisions based on an understanding of how to achieve your inferential goal. If the goal is to make claims with controlled error rates, prevent Type 1 error inflation, for example by correcting the alpha level where needed. If your goal is to explore data, feel free to do so, but know these explorations should be reported as such. When you design a study, follow the Journal Article Reporting Standards and distinguish tests with different inferential goals.

 

1 E.g., they discuss having to choose between Student’s t-test and Welch’s t-test, depending on wheter Levene’s test indicates the assumption of homogeneity is violated, which is not best practice – just follow R, and use Welch’s t-test by default.

2 But this is rare – only 2 out of 27 preregistered studies in Psychological Science made no deviations. https://royalsocietypublishing.org/doi/full/10.1098/rsos.211037We can probably do a bit better if we only preregistered predictions at a time where we really understand our manipulations and measures.

Why I care about replication studies

In 2009 I attended a European Social Cognition Network meeting in Poland. I only remember one talk from that meeting: A short presentation in a nearly empty room. The presenter was a young PhD student – Stephane Doyen. He discussed two studies where he tried to replicate a well-known finding in social cognition research related to elderly priming, which had shown that people walked more slowly after being subliminally primed with elderly related words, compared to a control condition.

His presentation blew my mind. But it wasn’t because the studies failed to replicate – it was widely known in 2009 that these studies couldn’t be replicated. Indeed, around 2007, I had overheard two professors in a corridor discussing the problem that there were studies in the literature everyone knew would not replicate. And they used this exact study on elderly priming as one example. The best solution the two professors came up with to correct the scientific record was to establish an independent committee of experts that would have the explicit task of replicating studies and sharing their conclusions with the rest of the world. To me, this sounded like a great idea.

And yet, in this small conference room in Poland, there was this young PhD student, acting as if we didn’t need specially convened institutions of experts to inform the scientific community that a study could not be replicated. He just got up, told us about how he wasn’t able to replicate this study, and sat down.

It was heroic.

If you’re struggling to understand why on earth I thought this was heroic, then this post is for you. You might have entered science in a different time. The results of replication studies are no longer communicated only face to face when running into a colleague in the corridor, or at a conference. But I was impressed in 2009. I had never seen anyone give a talk in which the only message was that an original effect didn’t stand up to scrutiny. People sometimes presented successful replications. They presented null effects in lines of research where the absence of an effect was predicted in some (but not all) tests. But I’d never seen a talk where the main conclusion was just: “This doesn’t seem to be a thing”.

On 12 September 2011 I sent Stephane Doyen an email. “Did you ever manage to publish some of that work? I wondered what has happened to it.” Honestly, I didn’t really expect that he would manage to publish these studies. After all, I couldn’t remember ever having seen a paper in the literature that was just a replication. So I asked, even though I did not expect he would have been able to publish his findings.

Surprisingly enough, he responded that the study would soon appear in press. I wasn’t fully aware of new developments in the publication landscape, where Open Access journals such as PlosOne published articles as long as the work was methodologically solid, and the conclusions followed from the data. I shared this news with colleagues, and many people couldn’t wait to read the paper: An article, in print, reporting the failed replication of a study many people knew to be not replicable. The excitement was not about learning something new. The excitement was about seeing replication studies with a null effect appear in print.

Regrettably, not everyone was equally excited. The publication also led to extremely harsh online comments from the original researcher about the expertise of the authors (e.g., suggesting that findings can fail to replicate due to “Incompetent or ill-informed researchers”), and the quality of PlosOne (“which quite obviously does not receive the usual high scientific journal standards of peer-review scrutiny”). This type of response happened again, and again, and again. Another failed replication led to a letter by the original authors that circulated over email among eminent researchers in the area, was addressed to the original authors, and ended with “do yourself, your junior co-authors, and the rest of the scientific community a favor. Retract your paper.”

Some of the historical record on discussions between researchers around between 2012-2015 survives online, in Twitter and Facebook discussions, and blogs. But recently, I started to realize that most early career researchers don’t read about the replication crisis through these original materials, but through summaries, which don’t give the same impression as having lived through these times. It was weird to see established researchers argue that people performing replications lacked expertise. That null results were never informative. That thanks to dozens of conceptual replications, the original theoretical point would still hold up even if direct replications failed. As time went by, it became even weirder to see that none of the researchers whose work was not corroborated in replication studies ever published a preregistered replication study to silence the critics. And why were there even two sides to this debate? Although most people agreed there was room for improvement and that replications should play some role in improving psychological science, there was no agreement on how this should work. I remember being surprised that a field was only now thinking about how to perform and interpret replication studies if we had been doing psychological research for more than a century.
 

I wanted to share this autobiographical memory, not just because I am getting old and nostalgic, but also because young researchers are most likely to learn about the replication crisis through summaries and high-level overviews. Summaries of history aren’t very good at communicating how confusing this time was when we lived through it. There was a lot of uncertainty, diversity in opinions, and lack of knowledge. And there were a lot of feelings involved. Most of those things don’t make it into written histories. This can make historical developments look cleaner and simpler than they actually were.

It might be difficult to understand why people got so upset about replication studies. After all, we live in a time where it is possible to publish a null result (e.g., in journals that only evaluate methodological rigor, but not novelty, journals that explicitly invite replication studies, and in Registered Reports). Don’t get me wrong: We still have a long way to go when it comes to funding, performing, and publishing replication studies, given their important role in establishing regularities, especially in fields that desire a reliable knowledge base. But perceptions about replication studies have changed in the last decade. Today, it is difficult to feel how unimaginable it used to be that researchers in psychology would share their results at a conference or in a scientific journal when they were not able to replicate the work by another researcher. I am sure it sometimes happened. But there was clearly a reason those professors I overheard in 2007 were suggesting to establish an independent committee to perform and publish studies of effects that were widely known to be not replicable.

As people started to talk about their experiences trying to replicate the work of others, the floodgates opened, and the shells fell off peoples’ eyes. Let me tell you that, from my personal experience, we didn’t call it a replication crisis for nothing. All of a sudden, many researchers who thought it was their own fault when they couldn’t replicate a finding started to realize this problem was systemic. It didn’t help that in those days it was difficult to communicate with people you didn’t already know. Twitter (which is most likely the medium through which you learned about this blog post) launched in 2006, but up to 2010 hardly any academics used this platform. Back then, it wasn’t easy to get information outside of the published literature. It’s difficult to express how it feels when you realize ‘it’s not me – it’s all of us’. Our environment influences which phenotypic traits express themselves. These experiences made me care about replication studies.

If you started in science when replications were at least somewhat more rewarded, it might be difficult to understand what people were making a fuss about in the past. It’s difficult to go back in time, but you can listen to the stories by people who lived through those times. Some highly relevant stories were shared after the recent multi-lab failed replication of ego-depletion (see tweets by Tom Carpenter and Dan Quintana). You can ask any older researcher at your department for similar stories, but do remember that it will be a lot more difficult to hear the stories of the people who left academia because most of their PhD consisted of failures to build on existing work.

If you want to try to feel what living through those times must have been like, consider this thought experiment. You attend a conference organized by a scientific society where all society members get to vote on who will be a board member next year. Before the votes are cast, the president of the society informs you that one of the candidates has been disqualified. The reason is that it has come to the society’s attention that this candidate selectively reported results from their research lines: The candidate submitted only those studies for publication that confirmed their predictions, and did not share studies with null results, even though these null results were well designed studies that tested sensible predictions. Most people in the audience, including yourself, were already aware of the fact that this person selectively reported their results. You knew publication bias was problematic from the moment you started to work in science, and the field knew it was problematic for centuries. Yet here you are, in a room at a conference, where this status quo is not accepted. All of a sudden, it feels like it is possible to actually do something about a problem that has made you feel uneasy ever since you started to work in academia.

You might live through a time where publication bias is no longer silently accepted as an unavoidable aspect of how scientists work, and if this happens, the field will likely have a very similar discussion as it did when it started to publish failed replication studies. And ten years later, a new generation will have been raised under different scientific norms and practices, where extreme publication bias is a thing of the past. It will be difficult to explain to them why this topic was a big deal a decade ago. But since you’re getting old and nostalgic yourself, you think that it’s useful to remind them, and you just might try to explain it to them in a 2 minute TikTok video.

History merely repeats itself. It has all been done before. Nothing under the sun is truly new.
Ecclesiastes 1:9

Thanks to Farid Anvari, Ruben Arslan, Noah van Dongen, Patrick Forscher, Peder Isager, Andrea Kis, Max Maier, Anne Scheel, Leonid Tiokhin, and Duygu Uygun for discussing this blog post with me (and in general for providing such a stimulating social and academic environment in times of a pandemic).

Requiring high-powered studies from scientists with resource constraints

This blog post is now included in the paper “Sample size justification” available at PsyArXiv. 

Underpowered studies make it very difficult to learn something useful from the studies you perform. Low power means you have a high probability of finding non-significant results, even when there is a true effect. Hypothesis tests which high rates of false negatives (concluding there is nothing, when there is something) become a malfunctioning tool. Low power is even more problematic combined with publication bias (shiny app). After repeated warnings over at least half a century, high quality journals are starting to ask authors who rely on hypothesis tests to provide a sample size justification based on statistical power.
The first time researchers use power analysis software, they typically think they are making a mistake, because the sample sizes required to achieve high power for hypothesized effects are much larger than the sample sizes they collected in the past. After double checking their calculations, and realizing the numbers are correct, a common response is that there is no way they are able to collect this number of observations.
Published articles on power analysis rarely tell researchers what they should do if they are hired on a 4 year PhD project where the norm is to perform between 4 to 10 studies that can cost at most 1000 euro each, learn about power analysis, and realize there is absolutely no way they will have the time and resources to perform high-powered studies, given that an effect size estimate from an unbiased registered report suggests the effect they are examining is half as large as they were led to believe based on a published meta-analysis from 2010. Facing a job market that under the best circumstances is a nontransparent marathon for uncertainty-fetishists, the prospect of high quality journals rejecting your work due to a lack of a solid sample size justification is not pleasant.
The reason that published articles do not guide you towards practical solutions for a lack of resources, is that there are no solutions for a lack of resources. Regrettably, the mathematics do not care about how small the participant payment budget is that you have available. This is not to say that you can not improve your current practices by reading up on best practices to increase the efficiency of data collection. Let me give you an overview of some things that you should immediately implement if you use hypothesis tests, and data collection is costly.
1) Use directional tests where relevant. Just following statements such as ‘we predict X is larger than Y’ up with a logically consistent test of that claim (e.g., a one-sided t-test) will easily give you an increase of 10% power in any well-designed study. If you feel you need to give effects in both directions a non-zero probability, then at least use lopsided tests.
2) Use sequential analysis whenever possible. It’s like optional stopping, but then without the questionable inflation of the false positive rate. The efficiency gains are so great that, if you complain about the recent push towards larger sample sizes without already having incorporated sequential analyses, I will have a hard time taking you seriously.
3) Increase your alpha level. Oh yes, I am serious. Contrary to what you might believe, the recommendation to use an alpha level of 0.05 was not the sixth of the ten commandments – it is nothing more than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify Your Alpha paper as an argument to not require an alpha level of 0.005: “without (1) increased funding, (2) a reward system that values large-scale collaboration and (3) clear recommendations for how to evaluate research with sample size constraints, lowering the significance threshold could adversely affect the breadth of research questions examined.” If you *have* to make a decision, and the data you can feasibly collect is limited, take a moment to think about how problematic Type 1 and Type 2 error rates are, and maybe minimize combined error rates instead of rigidly using a 5% alpha level.
4) Use within designs where possible. Especially when measurements are strongly correlated, this can lead to a substantial increase in power.
5) If you read this blog or follow me on Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and money: Designing statistically powerful studies while minimizing financial costs (link). They discuss I) better ways to screen participants for studies where participants need to be screened before participation, II) assigning participants unequally to conditions (if the control condition is much cheaper than the experimental condition, for example), III) using multiple measurements to increase measurement reliability (or use well-validated measures, if I may add), and IV) smart use of (preregistered, I’d recommend) covariates.
6) If you are really brave, you might want to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably, almost all approaches to statistical inferences become very limited when the number of observations is small. If you are very confident in your predictions (and your peers agree), incorporating prior information will give you a benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and colleagues.
Now if you care about efficiency, you might already have incorporated all these things. There is no way to further improve the statistical power of your tests, and by all plausible estimates of effects sizes you can expect or the smallest effect size you would be interested in, statistical power is low. Now what should you do?
What to do if best practices in study design won’t save you?
The first thing to realize is that you should not look at statistics to save you. There are no secret tricks or magical solutions. Highly informative experiments require a large number of observations. So what should we do then? The solutions below are, regrettably, a lot more work than making a small change to the design of your study. But it is about time we start to take them seriously. This is a list of solutions I see – but there is no doubt more we can/should do, so by all means, let me know your suggestions on twitter or in the comments.
1) Ask for a lot more money in your grant proposals.
Some grant organizations distribute funds to be awarded as a function of how much money is requested. If you need more money to collect informative data, ask for it. Obviously grants are incredibly difficult to get, but if you ask for money, include a budget that acknowledges that data collection is not as cheap as you hoped some years ago. In my experience, psychologists are often asking for much less money to collect data than other scientists. Increasing the requested funds for participant payment by a factor of 10 is often reasonable, given the requirements of journals to provide a solid sample size justification, and the more realistic effect size estimates that are emerging from preregistered studies.
2) Improve management.
If the implicit or explicit goals that you should meet are still the same now as they were 5 years ago, and you did not receive a miraculous increase in money and time to do research, then an update of the evaluation criteria is long overdue. I sincerely hope your manager is capable of this, but some ‘upward management’ might be needed. In the coda of Lakens & Evers (2014) we wrote “All else being equal, a researcher running properly powered studies will clearly contribute more to cumulative science than a researcher running underpowered studies, and if researchers take their science seriously, it should be the former who is rewarded in tenure systems and reward procedures, not the latter.” and “We believe reliable research should be facilitated above all else, and doing so clearly requires an immediate and irrevocable change from current evaluation practices in academia that mainly focus on quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure track, I thought it would be at least principled if I sent this coda to the head of my own department. He replied that the things we wrote made perfect sense, instituted a recommendation to aim for 90% power in studies our department intends to publish, and has since then tried to make sure quality, and not quantity, is used in evaluations within the faculty (as you might have guessed, I am not on the job market, nor do I ever hope to be).
3) Change what is expected from PhD students.
When I did my PhD, there was the assumption that you performed enough research in the 4 years you are employed as a full-time researcher to write a thesis with 3 to 5 empirical chapters (with some chapters having multiple studies). These studies were ideally published, but at least publishable. If we consider it important for PhD students to produce multiple publishable scientific articles during their PhD’s, this will greatly limit the types of research they can do. Instead of evaluating PhD students based on their publications, we can see the PhD as a time where researchers learn skills to become an independent researcher, and evaluate them not based on publishable units, but in terms of clearly identifiable skills. I personally doubt data collection is particularly educational after the 20th participant, and I would probably prefer to  hire a post-doc who had well-developed skills in programming, statistics, and who broadly read the literature, then someone who used that time to collect participant 21 to 200. If we make it easier for PhD students to demonstrate their skills level (which would include at least 1 well written article, I personally think) we can evaluate what they have learned in a more sensible manner than now. Currently, difference in the resources PhD students have at their disposal are a huge confound as we try to judge their skill based on their resume. Researchers at rich universities obviously have more resources – it should not be difficult to develop tools that allow us to judge the skills of people where resources are much less of a confound.
4) Think about the questions we collectively want answered, instead of the questions we can individually answer.
Our society has some serious issues that psychologists can help address. These questions are incredibly complex. I have long lost faith in the idea that a bottom-up organized scientific discipline that rewards individual scientists will manage to generate reliable and useful knowledge that can help to solve these societal issues. For some of these questions we need well-coordinated research lines where hundreds of scholars work together, pool their resources and skills, and collectively pursuit answers to these important questions. And if we are going to limit ourselves in our research to the questions we can answer in our own small labs, these big societal challenges are not going to be solved. Call me a pessimist. There is a reason we resort to forming unions and organizations that have to goal to collectively coordinate what we do. If you greatly dislike team science, don’t worry – there will always be options to make scientific contributions by yourself. But now, there are almost no ways for scientists who want to pursue huge challenges in large well-organized collectives of hundreds or thousands of scholars (for a recent exception that proves my rule by remaining unfunded: see the Psychological Science Accelerator). If you honestly believe your research question is important enough to be answered, then get together with everyone who also thinks so, and pursue answeres collectively. Doing so should, eventually (I know science funders are slow) also be more convincing as you ask for more resources to do the resource (as in point 1).
If you are upset that as a science we lost the blissful ignorance surrounding statistical power, and are requiring researchers to design informative studies, which hits substantially harder in some research fields than in others: I feel your pain. I have argued against universally lower alpha levels for you, and have tried to write accessible statistics papers that make you more efficient without increasing sample sizes. But if you are in a research field where even best practices in designing studies will not allow you to perform informative studies, then you need to accept the statistical reality you are in. I have already written too long a blog post, even though I could keep going on about this. My main suggestions are to ask for more money, get better management, change what we expect from PhD students, and self-organize – but there is much more we can do, so do let me know your top suggestions. This will be one of the many challenges our generation faces, but if we manage to address it, it will lead to a much better science.

Justify Your Alpha by Minimizing or Balancing Error Rates

A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends the ideas in this blog post is available at: https://psyarxiv.com/ts4r6

In 1957 Neyman wrote: “it appears desirable to determine the level of significance in accordance with quite a few circumstances that vary from one particular problem to the next.” Despite this good advice, social scientists developed the norm to always use an alpha level of 0.05 as a threshold when making predictions. In this blog post I will explain how you can set the alpha level so that it minimizes the combined Type 1 and Type 2 error rates (thus efficiently making decisions), or balance Type 1 and Type 2 error rates. You can use this approach to justify your alpha level, and guide your thoughts about how to design studies more efficiently.

Neyman (1933) provides an example of the reasoning process he believed researchers should go through. He explains how a researcher might have derived an important hypothesis that H0 is true (there is no effect), and will not want to ‘throw it aside too lightly’. The researcher would choose a ow alpha level (e.g.,  0.01). In another line of research, an experimenter might be interesting in detecting factors that would lead to the modification of a standard law, where the “importance of finding some new line of development here outweighs any loss due to a certain waste of effort in starting on a false trail”, and Neyman suggests to set the alpha level to for example 0.1.

Which is worse? A Type 1 Error or a Type 2 Error?

As you perform lines of research the data you collect are used as a guide to continue or abandon a hypothesis, to use one paradigm or another. One goal of well-designed experiments is to control the error rates as you make these decisions, so that you do not fool yourself too often in the long run.

Many researchers implicitly assume that Type 1 errors are more problematic than Type 2 errors. Cohen (1988) suggested a Type 2 error rate of 20%, and hence to aim for 80% power, but wrote “.20 is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of .20/.05, i.e., that Type I errors are of the order of four times as serious as Type II errors. This .80 desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value ad hoc”. More recently, researchers have argued that false negative constitute a much more serious problem in science (Fiedler, Kutzner, & Krueger, 2012). I always ask my 3rd year bachelor students: What do you think? Is a Type 1 error in your next study worse than a Type 2 error?

Last year I listened to someone who decided whether new therapies would be covered by the German healthcare system. She discussed Eye Movement Desensitization and Reprocessing (EMDR) therapy. I knew that the evidence that the therapy worked was very weak. As the talk started, I hoped they had decided not to cover EMDR. They did, and the researcher convinced me this was a good decision. She said that, although no strong enough evidence was available that it works, the costs of the therapy (which can be done behind a computer) are very low, it was applied in settings where no really good alternatives were available (e.g., inside prisons), and risk of negative consequences was basically zero. They were aware of the fact that there was a very high probability that EMDR was a Type 1 error, but compared to the cost of a Type 2 error, it was still better to accept the treatment. Another of my favorite examples comes from Field et al. (2004) who perform a cost-benefit analysis on whether to intervene when examining if a koala population is declining, and show the alpha should be set at 1 (one should always assume a decline is occurring and intervene). 

Making these decisions is difficult – but it is better to think about them, then to end up with error rates that do not reflect the errors you actually want to make. As Ulrich and Miller (2019) describe, the long run error rates you actually make depend on several unknown factors, such as the true effect size, and the prior probability that the null hypothesis is true. Despite these unknowns, you can design studies that have good error rates for an effect size you are interested in, given some sample size you are planning to collect. Let’s see how.

Balancing or minimizing error rates

Mudge, Baker, Edge, and Houlahan (2012) explain how researchers might want to minimize the total combined error rate. If both Type 1 as Type 2 errors are costly, then it makes sense to optimally reduce both errors as you do studies. This would make decision making overall most efficient. You choose an alpha level that, when used in the power analysis, leads to the lowest combined error rate. For example, with a 5% alpha and 80% power, the combined error rate is 5+20 = 25%, and if power is 99% and the alpha is 5% the combined error rate is 1 + 5 = 6%. Mudge and colleagues show that the increasing or reducing the alpha level can lower the combined error rate. This is one of the approaches we mentioned in our ‘Justify Your Alpha’ paper from 2018.

When we wrote ‘Justify Your Alpha’ we knew it would be a lot of work to actually develop methods that people can use. For months, I would occasionally revisit the code Mudge and colleagues used in their paper, which is an adaptation of the pwr library in R, but the code was too complex and I could not get to the bottom of how it worked. After leaving this aside for some months, during which I improved my R skills, some days ago I took a long shower and suddenly realized that I did not need to understand the code by Mudge and colleagues. Instead of getting their code to work, I could write my own code from scratch. Such realizations are my justification for taking showers that are longer than is environmentally friendly.

If you want to balance or minimize error rates, the tricky thing is that the alpha level you set determines the Type 1 error rate, but through it’s influence on the statistical power, also influenced the Type 2 error rate. So I wrote a function that examines the range of possible alpha levels (from 0 to 1) and minimizes either the total error (Type 1 + Type 2) or minimizes the difference between the Type 1 and Type 2 error rates, balancing the error rates. It then returns the alpha (Type 1 error rate) and the beta (Type 2 error). You can enter any analytic power function that normally works in R and would output the calculated power.

Minimizing Error Rates

Below is the version of the optimal_alpha function used in this blog. Yes, I am defining a function inside another function and this could all look a lot prettier – but it works for now. I plan to clean up the code when I archive my blog posts on how to justify alpha level in a journal, and will make an R package when I do.


The code requires requires you to specify the power function (in a way that the code returns the power, hence the $power at the end) for your test, where the significance level is a variable ‘x’. In this power function you specify the effect size (such as the smallest effect size you are interested in) and the sample size. In my experience, sometimes the sample size is determined by factors outside the control of the researcher. For example, you are working with a existing data, or you are studying a sample size that is limited (e.g., all students in a school). Other times, people have a maximum sample size they can feasibly collect, and accept the error rates that follow from this feasibility limitation. If your sample size is not limited, you can increase the sample size until you are happy with the error rates.

The code calculates the Type 2 error (1-power) across a range of alpha values. For example, we want to calculate the optimal alpha level for a independent t-test. Assume our smallest effect size of interest is d = 0.5, and we are planning to collect 100 participants in each group. We would normally calculate power as follows:

pwr.t.test(d = 0.5, n = 100, sig.level = 0.05, type = ‘two.sample’, alternative = ‘two.sided’)$power

This analysis tells us that we have 94% power with a 5% alpha level for our smallest effect size of interest, d = 0.5, when we collect 100 participants in each condition.

If we want to minimize our total error rates, we would enter this function in our optimal_alpha function (while replacing the sig.level argument with ‘x’ instead of 0.05, because we are varying the value to determine the lowest combined error rate).

res = optimal_alpha(power_function = pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”)

res$alpha
## [1] 0.05101728
res$beta
## [1] 0.05853977

We see that an alpha level of 0.051 slightly improved the combined error rate, since it will lead to a Type 2 error rate of 0.059 for a smallest effect size of interest of d = 0.5. The combined error rate is 0.11. For comparison, lowering the alpha level to 0.005 would lead to a much larger combined error rate of 0.25.
What would happen if we had decided to collect 200 participants per group, or only 50? With 200 participants per group we would have more than 99% power for d = 0.05, and relatively speaking, a 5% Type 1 error with a 1% Type 2 error is slightly out of balance. In the age of big data, we nevertheless researchers use such suboptimal error rates this all the time due to their mindless choice for an alpha level of 0.05. When power is large the combined error rates can be smaller if the alpha level is lowered. If we just replace 100 by 200 in the function above, we see the combined Type 1 and Type 2 error rate is the lowest if we set the alpha level to 0.00866. If you collect large amounts of data, you should really consider lowering your alpha level.

If the maximum sample size we were willing to collect was 50 per group, the optimal alpha level to reduce the combined Type 1 and Type 2 error rates is 0.13. This means that we would have a 13% probability of deciding there is an effect when the null hypothesis is true. This is quite high! However, if we had used a 5% Type 1 error rate, the power would have been 69.69%, with a 30.31% Type 2 error rate, while the Type 2 error rate is ‘only’ 16.56% after increasing the alpha level to 0.13. We increase the Type 1 error rate by 8%, to reduce the Type 2 error rate by 13.5%. This increases the overall efficiency of the decisions we make.

This example relies on the pwr.t.test function in R, but any power function can be used. For example, the code to minimize the combined error rates for the power analysis for an equivalence test would be:

res = optimal_alpha(power_function = “powerTOSTtwo(alpha=x, N=200, low_eqbound_d=-0.4, high_eqbound_d=0.4)”)

Balancing Error Rates

You can choose to minimize the combined error rates, but you can also decide that it makes most sense to you to balance the error rates. For example, you think a Type 1 error is just as problematic as a Type 2 error, and therefore, you want to design a study that has balanced error rates for a smallest effect size of interest (e.g., a 5% Type 1 error rate and a 5% Type 2 error rate). Whether to minimize error rates or balance them can be specified in an additional argument in the function. The default it to minimize, but by adding error = “balance” an alpha level is given so that the Type 1 error rate equals the Type 2 error rate.

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “balance”)

res$alpha
## [1] 0.05488516
res$beta
## [1] 0.05488402

Repeating our earlier example, the alpha level is 0.055, such that the Type 2 error rate, given the smallest effect size of interest and the and the sample size, is also 0.055. I feel that even though this does not minimize the overall error rates, it is a justification strategy for your alpha level that often makes sense. If both Type 1 and Type 2 errors are equally problematic, we design a study where we are just as likely to make either mistake, for the effect size we care about.

Relative costs and prior probabilities

So far we have assumed a Type 1 error and Type 2 error are equally problematic. But you might believe Cohen (1988) was right, and Type 1 errors are exactly 4 times as bad as Type 2 errors. Or you might think they are twice as problematic, or 10 times as problematic. However you weigh them, as explained by Mudge et al., 2012, and Ulrich & Miller, 2019, you should incorporate those weights into your decisions.

The function has another optional argument, costT1T2, that allows you to specify the relative cost of Type1:Type2 errors. By default this is set to 1, but you can set it to 4 (or any other value) such that Type 1 errors are 4 times as costly as Type 2 errors. This will change the weight of Type 1 errors compared to Type 2 errors, and thus also the choice of the best alpha level.

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, costT1T2 = 4)

res$alpha
## [1] 0.01918735
res$beta
## [1] 0.1211773

Now, the alpha level that minimized the weighted Type 1 and Type 2 error rates is 0.019.

Similarly, you can take into account prior probabilities that either the null is true (and you will observe a Type 1 error), or that the alternative hypothesis is true (and you will observe a Type 2 error). By incorporating these expectations, you can minimize or balance error rates in the long run (assuming your priors are correct). Priors can be specified using the prior_H1H0 argument, which by default is 1 (H1 and H0 are equally likely). Setting it to 4 means you think the alternative hypothesis (and hence, Type 2 errors) are 4 times more likely than that the null hypothesis (and Type 1 errors).

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, prior_H1H0 = 2)

res$alpha
## [1] 0.07901679
res$beta
## [1] 0.03875676

If you think H1 is four times more likely to be true than H0, you need to worry less about Type 1 errors, and now the alpha that minimizes the weighted error rates is 0.079. It is always difficult to decide upon priors (unless you are Omniscient Jones) but even if you ignore them, you are making the decision that H1 and H0 are equally plausible.

Conclusion

You can’t abandon a practice without an alternative. Minimizing the combined error rate, or balancing error rates, provide two alternative approaches to the normative practice of setting the alpha level to 5%. Together with the approach to reduce the alpha level as a function of the sample size, I invite you to explore ways to set error rates based on something else than convention. A downside of abandoning mindless statistics is that you need to think of difficult questions. How much more negative is a Type 1 error than a Type 2 error? Do you have an ideas about the prior probabilities? And what is the smallest effect size of interest? Answering these questions is difficult, but considering them is important for any study you design. The experiments you make might very well be more informative, and more efficient. So give it a try.

References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillsdale, N.J: L. Erlbaum Associates.
Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The Long Way From ?-Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate. Perspectives on Psychological Science, 7(6), 661–669. https://doi.org/10.1177/1745691612462587
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
 Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631 
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

The New Heuristics

You can derive the age of a researcher based on the sample size they were told to use in a two independent group design. When I started my PhD, this number was 15, and when I ended, it was 20. This tells you I did my PhD between 2005 and 2010. If your number was 10, you have been in science much longer than I have, and if your number is 50, good luck with the final chapter of your PhD.
All these numbers are only sporadically the sample size you really need. As with a clock stuck at 9:30 in the morning, heuristics are sometimes right, but most often wrong. I think we rely way too often on heuristics for all sorts of important decisions we make when we do research. You can easily test whether you rely on a heuristic, or whether you can actually justify a decision you make. Ask yourself: Why?
I vividly remember talking to a researcher in 2012, a time where it started to become clear that many of the heuristics we relied on were wrong, and there was a lot of uncertainty about what good research practices looked like. She said: ‘I just want somebody to tell me what to do’. As psychologists, we work in a science where the answer to almost every research question is ‘it depends’. It should not be a surprise the same holds for how you design a study. For example, Neyman & Pearson (1933) perfectly illustrate how a statistician can explain the choices that need to be made, but in the end, only the researcher can make the final decision:
Due to a lack of training, most researchers do not have the skills to make these decisions. They need help, but do not even always have access to someone who can help them. It is therefore not surprising that articles and books that explain how to use useful tool provide some heuristics to get researchers started. An excellent example of this is Cohen’s classic work on power analysis. Although you need to think about the statistical power you want, as a heuristic, a minimum power of 80% is recommended. Let’s take a look at how Cohen (1988) introduces this benchmark.
It is rarely ignored. Note that we have a meta-heuristic here. Cohen argues a Type 1 error is 4 times as serious as a Type 2 error, and the Type 1 error is at 5%. Why? According to Fisher (1935) because it is a ‘convenient convention’. We are building a science on heuristics built on heuristics.
There has been a lot of discussion about how we need to improve psychological science in practice, and what good research practices look like. In my view, we will not have real progress when we replace old heuristics by new heuristics. People regularly complain to me about people who use what I would like to call ‘The New Heuristics’ (instead of The New Statistics), or ask me to help them write a rebuttal to a reviewer who is too rigidly applying a new heuristic. Let me give some recent examples.
People who used optional stopping in the past, and have learned this is p-hacking, think you can not look at the data as it comes in (you can, when done correctly, using sequential analyses, see Lakens, 2014). People make directional predictions, but test them with two-sided tests (even when you can pre-register your directional prediction). They think you need 250 participants (as an editor of a flagship journal claimed), even though there is no magical number that leads to high enough accuracy. They think you always need to justify sample sizes based on a power analysis (as a reviewer of a grant proposal claimed when rejecting a proposal) even though there are many ways to justify sample sizes. They argue meta-analysis is not a ‘valid technique’ only because the meta-analytic estimate can be biased (ignoring meta-analyses have many uses, including an analysis of heterogeneity, and all tests can be biased). They think all research should be preregistered or published as Registered Reports, even when the main benefit (preventing inflation of error rates for hypothesis tests due to flexibility in the data analysis) is not relevant for all research psychologists do. They think p-values are invalid and should be removed from scientific articles, even when in well-designed controlled experiments they might be the outcome of interest, especially early on in new research lines. I could go on.
Change is like a pendulum, swinging from one side to the other of a multi-dimensional space. People might be too loose, or too strict, too risky, or too risk-averse, too sexy, or too boring. When there is a response to newly identified problems, we often see people overreacting. If you can’t justify your decisions, you will just be pushed from one extreme on one of these dimensions to the opposite extreme. What you need is the weight of a solid justification to be able to resist being pulled in the direction of whatever you perceive to be the current norm. Learning The New Heuristics (for example setting the alpha level to 0.005 instead of 0.05) is not an improvement – it is just a change.
If we teach people The New Heuristics, we will get lost in the Bog of Meaningless Discussions About Why These New Norms Do Not Apply To Me. This is a waste of time. From a good justification it logically follows whether something applies to you or not. Don’t discuss heuristics – discuss justifications.
‘Why’ questions come at different levels. Surface level ‘why’ questions are explicitly left to the researcher – no one else can answer them. Why are you collecting 50 participants in each group? Why are you aiming for 80% power? Why are you using an alpha level of 5%? Why are you using this prior when calculating a Bayes factor? Why are you assuming equal variances and using Student’s t-test instead of Welch’s t-test? Part of the problem I am addressing here is that we do not discuss which questions are up to the researcher, and which are questions on a deeper level that you can simply accept without needing to provide a justification in your paper. This makes it relatively easy for researchers to pretend some ‘why’ questions are on a deeper level, and can be assumed without having to be justified. A field needs a continuing discussion about what we expect researchers to justify in their papers (for example by developing improved and detailed reporting guidelines). This will be an interesting discussion to have. For now, let’s limit ourselves to surface level questions that were always left up to researchers to justify (even though some researchers might not know any better than using a heuristic). In the spirit of the name of this blog, let’s focus on 20% of the problems that will improve 80% of what we do.
My new motto is ‘Justify Everything’ (it also works as a hashtag: #JustifyEverything). Your first response will be that this is not possible. You will think this is too much to ask. This is because you think that you will have to be able to justify everything. But that is not my view on good science. You do not have the time to learn enough to be able to justify all the choices you need to make when doing science. Instead, you could be working in a team of as many people as you need so that within your research team, there is someone who can give an answer if I ask you ‘Why?’. As a rule of thumb, a large enough research team in psychology has between 50 and 500 researchers, because that is how many people you need to make sure one of the researchers is able to justify why research teams in psychology need between 50 and 500 researchers.
Until we have transitioned into a more collaborative psychological science, we will be limited in how much and how well we can justify our decisions in our scientific articles. But we will be able to improve. Many journals are starting to require sample size justifications, which is a great example of what I am advocating for. Expert peer reviewers can help by pointing out where heuristics are used, but justifications are possible (preferably in open peer review, so that the entire community can learn). The internet makes it easier than ever before to ask other people for help and advice. And as with anything in a job as difficult as science, just get started. The #Justify20% hashtag will work just as well for now.