The Value of Preregistration for Psychological Science: A Conceptual Analysis

This blog is an excerpt of an invited journal article for a special issue of Japanese Psychological Review, that I am currently one week overdue with (but that I hope to complete soon). I hope this paper will raise the bar in the ongoing discussion about the value of preregistration in psychological science. If you have any feedback on what I wrote here, I would be very grateful to hear it, as it would allow me to improve the paper I am working on. If we want to fruitfully discuss preregistration, researchers need to provide a clear conceptual definition of preregistration, anchored in their philosophy of science.
For as long as data has been used to support scientific claims, people have tried to selectively present data in line with what they wish to be true. In his treatise ‘On the Decline of Science in England: And on Some of its Cases’ Babbage (1830) discusses what he calls cooking: “One of its numerous processes is to make multitudes of observations, and out of these to select those only which agree or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he can not pick out fifteen or twenty that will do up for serving.” In the past researchers have proposed solutions to prevent bias in the literature. With the rise of the internet it has become feasible to create online registries that ask researchers to specify their research design and the planned analyses. Scientific communities have started to make use of this opportunity (for a historical overview, see Wiseman, Watt, & Kornbrot, 2019).
Preregistration in psychology has been a good example of ‘learning by doing’. Best practices are continuously updated as we learn from practical challenges and early meta-scientific investigations into how preregistrations are performed. At the same time, discussions have emerged about what the goal of preregistration is, whether preregistration is desirable, and what preregistration should look like across different research areas. Every practice comes with costs and benefits, and it is useful to evaluate whether and when preregistration is worth it. Finally, it is important to evaluate how preregistration relates to different philosophies of science, and when it facilitates or distracts from goals scientists might have. The discussion about benefits and costs of preregistration has not been productive up to now because there is a general lack of a conceptual analysis of what preregistration entails and aims to accomplish, which leads to disagreements that are easily resolved when a conceptual definition would be available. Any conceptual definition about a tool that scientists use, such as preregistration, must examine the goals it achieves, and thus requires a clearly specified view on philosophy of science, which provides an analysis of different goals scientists might have. Discussing preregistration without discussing philosophy of science is a waste of time.

What is Preregistration For?

Preregistration has the goal to transparently prevent bias due to selectively reporting analyses. Since bias in estimates only occurs in relation to a true population parameter, preregistration as discussed here is limited to scientific questions that involve estimates of population values from samples. Researchers can have many different goals when collecting data, perhaps most notably theory development, as opposed to tests of statistical predictions derived from theories. When testing predictions, researchers might want a specific analysis to yield a null effect, for example to show that including a possible confound in an analysis does not change their main results. More often perhaps, they want an analysis to yield a statistically significant result, for example so that they can argue the results support their prediction, based on a p-value below 0.05. Both examples are sources of bias in the estimate of a population effect size. In this paper I will assume researchers use frequentist statistics, but all arguments can be generalized to Bayesian statistics (Gelman & Shalizi, 2013). When effect size estimates are biased, for example due to the desire to obtain a statistically significant result, hypothesis tests performed on these estimates have inflated Type 1 error rates, and when bias emerges due to the desire to obtain a non-significant test result, hypothesis tests have reduced statistical power. In line with the general tendency to weigh Type 1 error rates (the probability of obtaining a statistically significant result when there is no true effect) as more serious than Type 2 error rates (the probability of obtaining a non-significant result when there is a true effect), publications that discuss preregistration have been more concerned with inflated Type 1 error rates than with low power. However, one can easily think of situations where the latter is a bigger concern.
If the only goal of a researcher is to prevent bias it suffices to make a mental note of the planned analyses, or to verbally agree upon the planned analysis with collaborators, assuming we will perfectly remember our plans when analyzing the data. The reason to write down an analysis plan is not to prevent bias, but to transparently prevent bias. By including transparency in the definition of preregistration it becomes clear that the main goal of preregistration is to convince others that the reported analysis tested a clearly specified prediction. Not all approaches to knowledge generation value prediction, and it is important to evaluate if your philosophy of science values prediction to be able to decide if preregistration is a useful tool in your research. Mayo (2018) presents an overview of different arguments for the role prediction plays in science and arrives at a severity requirement: We can build on claims that passed tests that were highly capable of demonstrating the claim was false, but supported the prediction nevertheless. This requires that researchers who read about claims are able to evaluate the severity of a test. Preregistration facilitates this.
Although falsifying theories is a complex issue, falsifying statistical predictions is straightforward. Researchers can specify when they will interpret data as support for their claim based on the result of a statistical test, and when not. An example is a directional (or one-sided) t-test testing whether an observed mean is larger than zero. Observing a value statistically smaller or equal to zero would falsify this statistical prediction (as long as statistical assumptions of the test hold, and with some error rate in frequentist approaches to statistics). In practice, only range predictions can be statistically falsified. Because resources and measurement accuracy are not infinitely large, there is always a value close enough to zero that is statistically impossible to distinguish from zero. Therefore, researchers will need to specify at least some possible outcomes that would not be considered support for their prediction that statistical tests can pick up on. How such bounds are determined is a massively understudied problem in psychology, but it is essential to have falsifiable predictions.
Where bounds of a range prediction enable statistical falsification, the specification of these bounds is not enough to evaluate how highly capable a test was to demonstrate a claim was wrong. Meehl (1990) argues that we are increasingly impressed by a prediction, the more ways a prediction could have been wrong.  He writes (1990, p. 128): “The working scientist is often more impressed when a theory predicts something within, or close to, a narrow interval than when it predicts something correctly within a wide one.” Imagine making a prediction about where a dart will land if I throw it at a dartboard. You will be more impressed with my darts skills if I predict I will hit the bullseye, and I hit the bullseye, than when I predict to hit the dartboard, and I hit the dartboard. Making very narrow range predictions is a way to make it statistically likely to falsify your prediction, if it is wrong. It is also possible to make theoretically risky predictions, for example by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory. Regardless of how researchers increase the capability of a test to be wrong, the approach to scientific progress described here places more faith in claims based on predictions that have a higher capability of being falsified, but where data nevertheless supports the prediction. Anyone is free to choose a different philosophy of science, and create a coherent analysis of the goals of preregistration in that framework, but as far as I am aware, Mayo’s severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration.
Researchers admit to research practices that make their predictions, or the empirical support for their prediction, look more impressive than it is. One example of such a practice is optional stopping, where researchers collect a number of datapoints, perform statistical analyses, and continue the data collection if the result is not statistically significant. In theory, a researcher who is willing to continue collecting data indefinitely will always find a statistically significant result. By repeatedly looking at the data, the Type 1 error rate can inflate to 100%. Even though in practice the inflation will be smaller, optional stopping strongly increases the probability that a researcher can interpret their result as support for their prediction. In the extreme case, where a researcher is 100% certain that they will observe a statistically significant result when they perform their statistical test, their prediction will never be falsified. Providing support for a claim by relying on optional stopping should not increase our faith in the claim by much, or even at all. As Mayo (2018, p. 222) writes: “The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’s objectivity.” If researchers were to transparently register their data collection strategy, readers could evaluate the capability of the test to falsify their prediction, conclude this capability is very small, and be relatively unimpressed by the study. If the stopping rule keeps the probability of finding a non-significant result when the prediction is incorrect high, and the data nevertheless support the prediction, we can choose to act as if the claim is correct because it has been severely tested. Preregistration thus functions as a tool to allow other researchers te transparently evaluate the severity with which a claim has been tested.
The severity of a test can also be compromised by selecting a hypothesis based on the observed results. In this practice, known as Hypothesizing After the Results are Known (HARKing, Kerr, 1998) researchers look at their data, and then select a prediction. This reversal of the typical hypothesis testing procedure makes the test incapable of demonstrating the claim was false. Mayo (2018) refers to this as ‘bad evidence, no test’. If we choose a prediction from among the options that yield a significant result, the claims we make base on these ‘predictions’ will never be wrong. In philosophies of science that value predictions, such claims do not increase our confidence that the claim is true, because it has not yet been tested. By preregistering our predictions, we transparently communicate to readers that our predictions predated looking at data, and therefore that the data we present as support of our prediction could have falsified our hypothesis. We have not made our test look more severe by narrowing the range of our predictions after looking at the data (like the Texas sharpshooter who draws the circles of the bullseye after shooting at the wall of the barn). A reader can transparently evaluate how severely our claim was tested.
As a final example of the value of preregistration to transparently allow readers to evaluate the capability of our prediction to be falsified, think about the scenario described by Babbage at the beginning of this article, where a researchers makes multitudes of observations, and selects out of all these tests only those that support their prediction. The larger the number of observations to choose from, the higher the probability that one of the possible tests could be presented as support for the hypothesis. Therefore, from a perspective on scientific knowledge generation where severe tests are valued, choosing to selectively report tests from among many tests that were performed strongly reduces the capability of a test to demonstrate the claim was false. This can be prevented by correcting for multiple testing by lowering the alpha level depending on the number of tests.
The fact that preregistration is about specifying ways in which your claim could be false is not generally appreciated. Preregistrations should carefully specify not just the analysis researchers plan to perform, but also when they would infer from the analyses that their prediction was wrong. As the preceding section explains, successful predictions impress us more when the data that was collected was capable of falsifying the prediction. Therefore, a preregistration document should give us all the required information that allows us to evaluate the severity of the test. Specifying exactly which test will be performed on the data is important, but not enough. Researchers should also specify when they will conclude the prediction was not supported. Beyond specifying the analysis plan in detail, the severity of a test can be increased by narrowing the range of values that are predicted (without increasing the Type 1 and Type 2 error rate), or making the theoretical prediction more specific by specifying detailed circumstances under which the effect will be observed, and when it will not be observed.

When is preregistration valuable?

If one agrees with the conceptual analysis above, it follows that preregistration adds value for people who choose to increase their faith in claims that are supported by severe tests and predictive successes. Whether this seems reasonable depends on your philosophy of science. Preregistration itself does not make a study better or worse compared to a non-preregistered study. Sometimes, being able to transparently evaluate a study (and its capability to demonstrate claims were false) will reveal a study was completely uninformative. Other times we might be able to evaluate the capability of a study to demonstrate a claim was false even if the study is not transparently preregistered. Examples are studies where there is no room for bias, because the analyses are perfectly constrained by theory, or because it is not possible to analyze the data in any other way than was reported. Although the severity of a test is in principle unrelated to whether it is pre-registered or not, in practice there will be a positive correlation that is caused by the studies where the ability to evaluate how capable these studies were to demonstrate a claim was false is improved by transparently preregistering, such as studies with multiple dependent variables to choose from, which do not use standardized measurement scale so that the dependent variable can be calculated in different ways, or where additional data is easily collected, to name a few.
We can apply our conceptual analysis of preregistration to hypothetical real-life situations to gain a better insight into when preregistration is a valuable tool, and when not. For example, imagine a researcher who preregisters an experiment where the main analysis tests a linear relationship between two variables. This test yields a non-significant result, thereby failing to support the prediction. In an exploratory analysis the authors find that fitting a polynomial model yields a significant test result with a low p-value. A reviewer of their manuscript has studied the same relationship, albeit in a slightly different context and with another measure, and has unpublished data from multiple studies that also yielded polynomial relationships. The reviewer also has a tentative idea about the underlying mechanism that causes not a linear, but a polynomial, relationship. The original authors will be of the opinion that the claim of a polynomial relationship has passed a less severe test than their original prediction of a linear prediction would have passed (had it been supported). However, the reviewer would never have preregistered a linear relationship to begin with, and therefore does not evaluate the switch to a polynomial test in the exploratory result section as something that reduces the severity of the test. Given that the experiment was well-designed, the test for a polynomial relationship will be judged as having greater severity by the reviewer than by the authors. In this hypothetical example the reviewer has additional data that would have changed the hypothesis they would have preregistered in the original study. It is also possible that the difference in evaluation of the exploratory test for a polynomial relationship is based purely on a subjective prior belief, or on the basis of knowledge about an existing well-supported theory that would predict a polynomial, but not a linear, relationship.
Now imagine that our reviewer asks for the raw data to test whether their assumed underlying mechanism is supported. They receive the dataset, and looking through the data and the preregistration, the reviewer realizes that the original authors didn’t adhere to their preregistered analysis plan. They violated their stopping rule, analyzing the data in batches of four and stopping earlier than planned. They did not carefully specify how to compute their dependent variable in the preregistration, and although the reviewer has no experience with the measure that has been used, the dataset contains eight ways in which the dependent variable was calculated. Only one of the eight ways in which the dependent variable yields a significant effect for the polynomial relationship. Faced with this additional information, the reviewer believes it is much more likely that the analysis testing the claim was the result of selective reporting, and now is of the opinion the polynomial relationship was not severely tested.
Both of these evaluations of how severely a hypothesis was tested were perfectly reasonable, given the information reviewer had available. It reveals how sometimes switching from a preregistered analysis to an exploratory analysis does not impact the evaluation of the severity of the test by a reviewer, while in other cases a selectively reported result does reduce the perceived severity with which a claim has been tested. Preregistration makes more information available to readers that can be used to evaluate the severity of a test, but readers might not always evaluate the information in a preregistration in the same way. Whether a design or analytic choice increases or decreases the capability of a claim to be falsified depends on statistical theory, as well as on prior beliefs about the theory that is tested. Some practices are known to reduce the severity of tests, such as optional stopping and selective reporting analyses that yield desired results, and therefore it is easier to evaluate how statistical practices impact the severity with which a claim is tested. If a preregistration is followed through exactly as planned then the tests that are performed have desired error rates in the long run, as long as the test assumptions are met. Note that because long run error rates are based on assumptions about the data generating process, which are never known, true error rates are unknown, and thus preregistration makes it relatively more likely that tests have desired long run error rates. The severity of a tests also depends on assumptions about the underlying theory, and how the theoretical hypothesis is translated into a statistical hypothesis. There will rarely be unanimous agreement on whether a specific operationalization is a better or worse test of a hypothesis, and thus researchers will differ in their evaluation of how severely specific design choices tests a claim. This once more highlights how preregistration does not automatically increase the severity of a test. When it prevents practices that are known to reduce the severity of tests, such as optional stopping, preregistration leads to a relative increase in the severity of a test compared a non-preregistered study. But when there is no objective evaluation of the severity of a test, as is often the case when we try to judge how severe a test was based on theoretical grounds, preregistration merely enables a transparent evaluation of the capability of a claim to be falsified.

Could #Blockchain provide the technical fix to solve science’s reproducibility crisis?

Soenke Bartling and Benedikt Fecher on the use of blockchain technology in research.

Currently blockchain is being hyped. Many claim that the blockchain revolution will affect not only our online life, but will profoundly change many more aspects of our society. Many foresee these changes as potentially being more far-reaching than those brought by the internet in the last two decades. If this holds true, it is certain that research and knowledge creation will also be affected by this. So, what is blockchain all about? More importantly, could knowledge creation benefit from it? One potential area it could be useful is in addressing the credibility and reproducibility crisis in science.

Article on #openaccess scholarly innovation and research infrastructure

In this article Benedikt Fecher and Gert Wagner argue that the current endeavors to achieve open access in scientific literature require a discussion about innovation in scholarly publishing and research infrastructure. Drawing on path dependence theory and addressing different open access (OA) models and recent political endeavors, the authors argue that academia is once again running the risk of outsourcing the organization of its content.

Council of the European Union calls for full open access to scientific research by 2020 – Creative Commons blog – Creative Commons

Science! by Alexandro Lacadena, CC BY-NC-ND 2.0 A few weeks ago we wrote about how the European Union is pushing ahead its support for open access to EU-funded scientific research and data. Today at the meeting of the Council of the European Union, the Council reinforced the commitment to making all scientific articles and data […]

The post Council of the European Union calls for full open access to scientific research by 2020 appeared first on Creative Commons blog.

Complying With HEFCE’s Open Access Policy: What You Need To Know

Complying With HEFCE’s Open Access Policy: What You Need To Know

Most researchers working in the UK will know that the Higher Education Funding Council for England (HEFCE) open access policy took effect from April 1st of this year, but what does that mean for you, and how can you make sure you are fully compliant?     What is the HEFCE open access policy? Around…

The ResearchGate Score: a good example of a bad metric

According to ResearchGate, the academic social networking site, their RG Score is “a new way to measure your scientific reputation”. With such high aims, Peter Kraker, Katy Jordan and Elisabeth Lex take a closer look at the opaque metric. By reverse engineering the score, they find that a significant weight is linked to ‘impact points’ – a similar metric to the widely discredited journal impact factor. Transparency in metrics is the only way scholarly measures can be put into context and the only way biases – which are inherent in all socially created metrics – can be uncovered.

RECODE 2015

 Find yourself wanting more after the workshop? Here you can the slides, resources and more! 

How you can help

 

More Information and Contact details

If you’d like to get in touch and discuss anything please feel free. Just email me at Joe [AT] RightToResearch [DOT] org

Want to stay up to date? 

Do it quickly and simply by signing up the the Student Statement on the Right to Research! This lets us know you believe in Open Access, and we’ll keep you up to date with big news and important actions. 

Also, follow us on twitterlike us on Facebook, check us out on LinkedinYoutube and yes, even Google+

Give us some feedback!

Tags:

Career,

Open Access,

Open Science,

Year,

Young European Associated Researchers

Year Conference 2015

 Find yourself wanting more after the workshop? Here you can the slides, resources and more! 

How you can help

 

More Information and Contact details

If you’d like to get in touch and discuss anything please feel free. Just email me at Joe [AT] RightToResearch [DOT] org

Want to stay up to date? 

Do it quickly and simply by signing up the the Student Statement on the Right to Research! This lets us know you believe in Open Access, and we’ll keep you up to date with big news and important actions. 

Also, follow us on twitterlike us on Facebook, check us out on LinkedinYoutube and yes, even Google+

Give us some feedback!

Tags:

Career,

Open Access,

Open Science,

Year,

Young European Associated Researchers

Innovating open at Mozfest

Innovating open at Mozfest

In late October, more than sixteen hundred developers, science buffs, and Open Web advocates converged on the Ravensbourne campus in South-East London to kick off MozFest, a hands-on festival dedicated to envisioning and creating the future of an open, global web. MozFest, now in its fifth year, began as a small, community-driven gathering with an…

A modest proposal

OALogoDear Professor X,

Thank you for the invitation to review for the Journal of X.  I appreciate the work you do and have done for the X community.

That said, I have decided not to review for Elsevier journals unless the journal making the request is willing to convert one mutually agreed-upon article in the same journal to Gold Open Access status.  If that condition can be met, I would be happy to review this paper, but if not, I’m afraid I must decline.

With best regards,

 –Dan Gezelter

Open Science Codefest

Open Science CodefestThe National Center for Ecological Analysis and Synthesis (NCEAS) at UCSB is co-sponsoring the Open Science Codefest 2014, which aims to bring together researchers from ecology, biodiversity science, and other earth and environmental sciences with computer scientists, software engineers, and developers to collaborate on coding projects of mutual interest.

Do you have a coding project that could benefit from collaboration, or software skills you’d like to share? The codefest will be held from September 2-4 in Santa Barbara, CA.

Inspired by hack-a-thons and organized in the participant-driven, unconference style, the Open Science Codefest is for anyone with an interesting problem, solution, or idea that intersects environmental science and computer programming. This is the conference where you will actually get stuff done – whether that’s coding up a new R module, developing an ontology, working on a data repository, creating data visualizations, dreaming up an interactive eco-game, discussing an idea, or any other concrete collaborative goal that interests a group of people.

Looks like a great program!