The Red Team Challenge (Part 3): Is it Feasible in Practice?

By Daniel Lakens & Leo Tiokhin

Also read Part 1 and Part 2 in this series on our Red Team Challenge.


Six weeks ago, we launched the Red Team Challenge: a feasibility study to see whether it could be worthwhile to pay people to find errors in scientific research. In our project, we wanted to see to what extent a “Red Team” – people hired to criticize a scientific study with the goal to improve it – would improve the quality of the resulting scientific work.

Currently, the way that error detection works in science is a bit peculiar. Papers go through the peer-review process and get the peer-reviewed “stamp of approval”. Then, upon publication, some of these same papers receive immediate and widespread criticism. Sometimes this even leads to formal corrections or retractions. And this happens even at some of the most prestigious scientific journals.

So, it seems that our current mechanisms of scientific quality control leave something to be desired. Nicholas Coles, Ruben Arslan, and the authors of this post (Leo Tiokhin and Daniël Lakens) were interested in whether Red Teams might be one way to improve quality control in science.

Ideally, a Red Team joins a research project from the start and criticizes each step of the process. However, doing this would have taken the duration of an entire study. At the time, it also seemed a bit premature — we didn’t know whether anyone would be interested in a Red Team approach, how it would work in practice, and so on. So, instead, Nicholas Coles, Brooke Frohlich, Jeff Larsen, and Lowell Gaertner volunteered one of their manuscripts (a completed study that they were ready to submit for publication). We put out a call on Twitter, Facebook, and the 20% Statistician blog, and 22 people expressed interest. On May 15th, we randomly selected five volunteers based on five areas of expertise: Åse Innes-Ker (affective science), Nicholas James (design/methods), Ingrid Aulike (statistics), Melissa Kline (computational reproducibility), and Tiago Lubiana (wildcard category). The Red Team was then given three weeks to report errors.

Our Red Team project was somewhat similar to traditional peer review, except that we 1) compensated Red Team members’ time with a $200 stipend, 2) explicitly asked the Red Teamers to identify errors in any part of the project (i.e., not just writing), 3) gave the Red Team full access to the materials, data, and code, and 4) provided financial incentives for identifying critical errors (a donation to the GiveWell charity non-profit for each unique “critical error” discovered).

The Red Team submitted 107 error reports. Ruben Arslan–who helped inspire this project with his BugBountyProgram–served as the neutral arbiter. Ruben examined the reports, evaluated the authors’ responses, and ultimately decided whether an issue was “critical” (see this post for Ruben’s reflection on the Red Team Challenge) Of the 107 reports, Ruben concluded that there were 18 unique critical issues (for details, see this project page). Ruben decided that any major issues that potentially invalidated inferences were worth $100, minor issues related to computational reproducibility were worth $20, and minor issues that could be resolved without much work were worth $10. After three weeks, the total final donation was $660. The Red Team detected 5 major errors. These included two previously unknown limitations of a key manipulation, inadequacies in the design and description of the power analysis, an incorrectly reported statistical test in the supplemental materials, and a lack of information about the sample in the manuscript. Minor issues concerned reproducibility of code and clarifications about the procedure.



After receiving this feedback, Nicholas Coles and his co-authors decided to hold off submitting their manuscript (see this post for Nicholas’ personal reflection). They are currently conducting a new study to address some of the issues raised by the Red Team.

We consider this to be a feasibility study of whether a Red Team approach is practical and worthwhile. So, based on this study, we shouldn’t draw any conclusions about a Red Team approach in science except one: it can be done.

That said, our study does provide some food for thought. Many people were eager to join the Red Team. The study’s corresponding author, Nicholas Coles, was graciously willing to acknowledge issues when they were pointed out. And it was obvious that, had these issues been pointed out earlier, the study would have been substantially improved before being carried out. These findings make us optimistic that Red Teams can be useful and feasible to implement.

In an earlier column, the issue was raised that rewarding Red Team members with co-authorship on the subsequent paper would create a conflict of interest — too severe criticism on the paper might make the paper unpublishable. So, instead, we paid each Red Teamer $200 for their service. We wanted to reward people for their time. We did not want to reward them only for finding issues because, before we knew that 19 unique issues would be found, we were naively worried that the Red Team might find few things wrong with the paper. In interviews with Red Team members, it became clear that the charitable donations for each issue were not a strong motivator. Instead, people were just happy to detect issues for decent pay. They didn’t think that they deserved authorship for their work, and several Red Team members didn’t consider authorship on an academic paper to be valuable, given their career goals.

After talking with the Red Team members, we started to think that certain people might enjoy Red Teaming as a job – it is challenging, requires skills, and improves science. This opens up the possibility of a freelance services marketplace (such as Fiverr) for error detection, where Red Team members are hired at an hourly rate and potentially rewarded for finding errors. It should be feasible to hire people to check for errors at each phase of a project, depending on their expertise and reputation as good error-detectors. If researchers do not have money for such a service, they might be able to set up a volunteer network where people “Red Team” each other’s projects. It could also be possible for universities to create Red Teams (e.g., Cornell University has a computational reproducibility service that researchers can hire).

As scientists, we should ask ourselves when, and for which type of studies, we want to invest time and/or money to make sure that published work is as free from errors as possible. As we continue to consider ways to increase the reliability of science, a Red Team approach might be something to further explore.

Justify Your Alpha by Minimizing or Balancing Error Rates

A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends the ideas in this blog post is available at: https://psyarxiv.com/ts4r6

In 1957 Neyman wrote: “it appears desirable to determine the level of significance in accordance with quite a few circumstances that vary from one particular problem to the next.” Despite this good advice, social scientists developed the norm to always use an alpha level of 0.05 as a threshold when making predictions. In this blog post I will explain how you can set the alpha level so that it minimizes the combined Type 1 and Type 2 error rates (thus efficiently making decisions), or balance Type 1 and Type 2 error rates. You can use this approach to justify your alpha level, and guide your thoughts about how to design studies more efficiently.

Neyman (1933) provides an example of the reasoning process he believed researchers should go through. He explains how a researcher might have derived an important hypothesis that H0 is true (there is no effect), and will not want to ‘throw it aside too lightly’. The researcher would choose a ow alpha level (e.g.,  0.01). In another line of research, an experimenter might be interesting in detecting factors that would lead to the modification of a standard law, where the “importance of finding some new line of development here outweighs any loss due to a certain waste of effort in starting on a false trail”, and Neyman suggests to set the alpha level to for example 0.1.

Which is worse? A Type 1 Error or a Type 2 Error?

As you perform lines of research the data you collect are used as a guide to continue or abandon a hypothesis, to use one paradigm or another. One goal of well-designed experiments is to control the error rates as you make these decisions, so that you do not fool yourself too often in the long run.

Many researchers implicitly assume that Type 1 errors are more problematic than Type 2 errors. Cohen (1988) suggested a Type 2 error rate of 20%, and hence to aim for 80% power, but wrote “.20 is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of .20/.05, i.e., that Type I errors are of the order of four times as serious as Type II errors. This .80 desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns in his specific research investigation to choose a value ad hoc”. More recently, researchers have argued that false negative constitute a much more serious problem in science (Fiedler, Kutzner, & Krueger, 2012). I always ask my 3rd year bachelor students: What do you think? Is a Type 1 error in your next study worse than a Type 2 error?

Last year I listened to someone who decided whether new therapies would be covered by the German healthcare system. She discussed Eye Movement Desensitization and Reprocessing (EMDR) therapy. I knew that the evidence that the therapy worked was very weak. As the talk started, I hoped they had decided not to cover EMDR. They did, and the researcher convinced me this was a good decision. She said that, although no strong enough evidence was available that it works, the costs of the therapy (which can be done behind a computer) are very low, it was applied in settings where no really good alternatives were available (e.g., inside prisons), and risk of negative consequences was basically zero. They were aware of the fact that there was a very high probability that EMDR was a Type 1 error, but compared to the cost of a Type 2 error, it was still better to accept the treatment. Another of my favorite examples comes from Field et al. (2004) who perform a cost-benefit analysis on whether to intervene when examining if a koala population is declining, and show the alpha should be set at 1 (one should always assume a decline is occurring and intervene). 

Making these decisions is difficult – but it is better to think about them, then to end up with error rates that do not reflect the errors you actually want to make. As Ulrich and Miller (2019) describe, the long run error rates you actually make depend on several unknown factors, such as the true effect size, and the prior probability that the null hypothesis is true. Despite these unknowns, you can design studies that have good error rates for an effect size you are interested in, given some sample size you are planning to collect. Let’s see how.

Balancing or minimizing error rates

Mudge, Baker, Edge, and Houlahan (2012) explain how researchers might want to minimize the total combined error rate. If both Type 1 as Type 2 errors are costly, then it makes sense to optimally reduce both errors as you do studies. This would make decision making overall most efficient. You choose an alpha level that, when used in the power analysis, leads to the lowest combined error rate. For example, with a 5% alpha and 80% power, the combined error rate is 5+20 = 25%, and if power is 99% and the alpha is 5% the combined error rate is 1 + 5 = 6%. Mudge and colleagues show that the increasing or reducing the alpha level can lower the combined error rate. This is one of the approaches we mentioned in our ‘Justify Your Alpha’ paper from 2018.

When we wrote ‘Justify Your Alpha’ we knew it would be a lot of work to actually develop methods that people can use. For months, I would occasionally revisit the code Mudge and colleagues used in their paper, which is an adaptation of the pwr library in R, but the code was too complex and I could not get to the bottom of how it worked. After leaving this aside for some months, during which I improved my R skills, some days ago I took a long shower and suddenly realized that I did not need to understand the code by Mudge and colleagues. Instead of getting their code to work, I could write my own code from scratch. Such realizations are my justification for taking showers that are longer than is environmentally friendly.

If you want to balance or minimize error rates, the tricky thing is that the alpha level you set determines the Type 1 error rate, but through it’s influence on the statistical power, also influenced the Type 2 error rate. So I wrote a function that examines the range of possible alpha levels (from 0 to 1) and minimizes either the total error (Type 1 + Type 2) or minimizes the difference between the Type 1 and Type 2 error rates, balancing the error rates. It then returns the alpha (Type 1 error rate) and the beta (Type 2 error). You can enter any analytic power function that normally works in R and would output the calculated power.

Minimizing Error Rates

Below is the version of the optimal_alpha function used in this blog. Yes, I am defining a function inside another function and this could all look a lot prettier – but it works for now. I plan to clean up the code when I archive my blog posts on how to justify alpha level in a journal, and will make an R package when I do.


The code requires requires you to specify the power function (in a way that the code returns the power, hence the $power at the end) for your test, where the significance level is a variable ‘x’. In this power function you specify the effect size (such as the smallest effect size you are interested in) and the sample size. In my experience, sometimes the sample size is determined by factors outside the control of the researcher. For example, you are working with a existing data, or you are studying a sample size that is limited (e.g., all students in a school). Other times, people have a maximum sample size they can feasibly collect, and accept the error rates that follow from this feasibility limitation. If your sample size is not limited, you can increase the sample size until you are happy with the error rates.

The code calculates the Type 2 error (1-power) across a range of alpha values. For example, we want to calculate the optimal alpha level for a independent t-test. Assume our smallest effect size of interest is d = 0.5, and we are planning to collect 100 participants in each group. We would normally calculate power as follows:

pwr.t.test(d = 0.5, n = 100, sig.level = 0.05, type = ‘two.sample’, alternative = ‘two.sided’)$power

This analysis tells us that we have 94% power with a 5% alpha level for our smallest effect size of interest, d = 0.5, when we collect 100 participants in each condition.

If we want to minimize our total error rates, we would enter this function in our optimal_alpha function (while replacing the sig.level argument with ‘x’ instead of 0.05, because we are varying the value to determine the lowest combined error rate).

res = optimal_alpha(power_function = pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”)

res$alpha
## [1] 0.05101728
res$beta
## [1] 0.05853977

We see that an alpha level of 0.051 slightly improved the combined error rate, since it will lead to a Type 2 error rate of 0.059 for a smallest effect size of interest of d = 0.5. The combined error rate is 0.11. For comparison, lowering the alpha level to 0.005 would lead to a much larger combined error rate of 0.25.
What would happen if we had decided to collect 200 participants per group, or only 50? With 200 participants per group we would have more than 99% power for d = 0.05, and relatively speaking, a 5% Type 1 error with a 1% Type 2 error is slightly out of balance. In the age of big data, we nevertheless researchers use such suboptimal error rates this all the time due to their mindless choice for an alpha level of 0.05. When power is large the combined error rates can be smaller if the alpha level is lowered. If we just replace 100 by 200 in the function above, we see the combined Type 1 and Type 2 error rate is the lowest if we set the alpha level to 0.00866. If you collect large amounts of data, you should really consider lowering your alpha level.

If the maximum sample size we were willing to collect was 50 per group, the optimal alpha level to reduce the combined Type 1 and Type 2 error rates is 0.13. This means that we would have a 13% probability of deciding there is an effect when the null hypothesis is true. This is quite high! However, if we had used a 5% Type 1 error rate, the power would have been 69.69%, with a 30.31% Type 2 error rate, while the Type 2 error rate is ‘only’ 16.56% after increasing the alpha level to 0.13. We increase the Type 1 error rate by 8%, to reduce the Type 2 error rate by 13.5%. This increases the overall efficiency of the decisions we make.

This example relies on the pwr.t.test function in R, but any power function can be used. For example, the code to minimize the combined error rates for the power analysis for an equivalence test would be:

res = optimal_alpha(power_function = “powerTOSTtwo(alpha=x, N=200, low_eqbound_d=-0.4, high_eqbound_d=0.4)”)

Balancing Error Rates

You can choose to minimize the combined error rates, but you can also decide that it makes most sense to you to balance the error rates. For example, you think a Type 1 error is just as problematic as a Type 2 error, and therefore, you want to design a study that has balanced error rates for a smallest effect size of interest (e.g., a 5% Type 1 error rate and a 5% Type 2 error rate). Whether to minimize error rates or balance them can be specified in an additional argument in the function. The default it to minimize, but by adding error = “balance” an alpha level is given so that the Type 1 error rate equals the Type 2 error rate.

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “balance”)

res$alpha
## [1] 0.05488516
res$beta
## [1] 0.05488402

Repeating our earlier example, the alpha level is 0.055, such that the Type 2 error rate, given the smallest effect size of interest and the and the sample size, is also 0.055. I feel that even though this does not minimize the overall error rates, it is a justification strategy for your alpha level that often makes sense. If both Type 1 and Type 2 errors are equally problematic, we design a study where we are just as likely to make either mistake, for the effect size we care about.

Relative costs and prior probabilities

So far we have assumed a Type 1 error and Type 2 error are equally problematic. But you might believe Cohen (1988) was right, and Type 1 errors are exactly 4 times as bad as Type 2 errors. Or you might think they are twice as problematic, or 10 times as problematic. However you weigh them, as explained by Mudge et al., 2012, and Ulrich & Miller, 2019, you should incorporate those weights into your decisions.

The function has another optional argument, costT1T2, that allows you to specify the relative cost of Type1:Type2 errors. By default this is set to 1, but you can set it to 4 (or any other value) such that Type 1 errors are 4 times as costly as Type 2 errors. This will change the weight of Type 1 errors compared to Type 2 errors, and thus also the choice of the best alpha level.

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, costT1T2 = 4)

res$alpha
## [1] 0.01918735
res$beta
## [1] 0.1211773

Now, the alpha level that minimized the weighted Type 1 and Type 2 error rates is 0.019.

Similarly, you can take into account prior probabilities that either the null is true (and you will observe a Type 1 error), or that the alternative hypothesis is true (and you will observe a Type 2 error). By incorporating these expectations, you can minimize or balance error rates in the long run (assuming your priors are correct). Priors can be specified using the prior_H1H0 argument, which by default is 1 (H1 and H0 are equally likely). Setting it to 4 means you think the alternative hypothesis (and hence, Type 2 errors) are 4 times more likely than that the null hypothesis (and Type 1 errors).

res = optimal_alpha(power_function = “pwr.t.test(d=0.5, n=100, sig.level = x, type=’two.sample’, alternative=’two.sided’)$power”, error = “minimal”, prior_H1H0 = 2)

res$alpha
## [1] 0.07901679
res$beta
## [1] 0.03875676

If you think H1 is four times more likely to be true than H0, you need to worry less about Type 1 errors, and now the alpha that minimizes the weighted error rates is 0.079. It is always difficult to decide upon priors (unless you are Omniscient Jones) but even if you ignore them, you are making the decision that H1 and H0 are equally plausible.

Conclusion

You can’t abandon a practice without an alternative. Minimizing the combined error rate, or balancing error rates, provide two alternative approaches to the normative practice of setting the alpha level to 5%. Together with the approach to reduce the alpha level as a function of the sample size, I invite you to explore ways to set error rates based on something else than convention. A downside of abandoning mindless statistics is that you need to think of difficult questions. How much more negative is a Type 1 error than a Type 2 error? Do you have an ideas about the prior probabilities? And what is the smallest effect size of interest? Answering these questions is difficult, but considering them is important for any study you design. The experiments you make might very well be more informative, and more efficient. So give it a try.

References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillsdale, N.J: L. Erlbaum Associates.
Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The Long Way From ?-Error Control to Validity Proper: Problems With a Short-Sighted False-Positive Debate. Perspectives on Psychological Science, 7(6), 661–669. https://doi.org/10.1177/1745691612462587
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
 Miller, J., & Ulrich, R. (2019). The quest for an optimal alpha. PLOS ONE, 14(1), e0208631. https://doi.org/10.1371/journal.pone.0208631 
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734

The New Heuristics

You can derive the age of a researcher based on the sample size they were told to use in a two independent group design. When I started my PhD, this number was 15, and when I ended, it was 20. This tells you I did my PhD between 2005 and 2010. If your number was 10, you have been in science much longer than I have, and if your number is 50, good luck with the final chapter of your PhD.
All these numbers are only sporadically the sample size you really need. As with a clock stuck at 9:30 in the morning, heuristics are sometimes right, but most often wrong. I think we rely way too often on heuristics for all sorts of important decisions we make when we do research. You can easily test whether you rely on a heuristic, or whether you can actually justify a decision you make. Ask yourself: Why?
I vividly remember talking to a researcher in 2012, a time where it started to become clear that many of the heuristics we relied on were wrong, and there was a lot of uncertainty about what good research practices looked like. She said: ‘I just want somebody to tell me what to do’. As psychologists, we work in a science where the answer to almost every research question is ‘it depends’. It should not be a surprise the same holds for how you design a study. For example, Neyman & Pearson (1933) perfectly illustrate how a statistician can explain the choices that need to be made, but in the end, only the researcher can make the final decision:
Due to a lack of training, most researchers do not have the skills to make these decisions. They need help, but do not even always have access to someone who can help them. It is therefore not surprising that articles and books that explain how to use useful tool provide some heuristics to get researchers started. An excellent example of this is Cohen’s classic work on power analysis. Although you need to think about the statistical power you want, as a heuristic, a minimum power of 80% is recommended. Let’s take a look at how Cohen (1988) introduces this benchmark.
It is rarely ignored. Note that we have a meta-heuristic here. Cohen argues a Type 1 error is 4 times as serious as a Type 2 error, and the Type 1 error is at 5%. Why? According to Fisher (1935) because it is a ‘convenient convention’. We are building a science on heuristics built on heuristics.
There has been a lot of discussion about how we need to improve psychological science in practice, and what good research practices look like. In my view, we will not have real progress when we replace old heuristics by new heuristics. People regularly complain to me about people who use what I would like to call ‘The New Heuristics’ (instead of The New Statistics), or ask me to help them write a rebuttal to a reviewer who is too rigidly applying a new heuristic. Let me give some recent examples.
People who used optional stopping in the past, and have learned this is p-hacking, think you can not look at the data as it comes in (you can, when done correctly, using sequential analyses, see Lakens, 2014). People make directional predictions, but test them with two-sided tests (even when you can pre-register your directional prediction). They think you need 250 participants (as an editor of a flagship journal claimed), even though there is no magical number that leads to high enough accuracy. They think you always need to justify sample sizes based on a power analysis (as a reviewer of a grant proposal claimed when rejecting a proposal) even though there are many ways to justify sample sizes. They argue meta-analysis is not a ‘valid technique’ only because the meta-analytic estimate can be biased (ignoring meta-analyses have many uses, including an analysis of heterogeneity, and all tests can be biased). They think all research should be preregistered or published as Registered Reports, even when the main benefit (preventing inflation of error rates for hypothesis tests due to flexibility in the data analysis) is not relevant for all research psychologists do. They think p-values are invalid and should be removed from scientific articles, even when in well-designed controlled experiments they might be the outcome of interest, especially early on in new research lines. I could go on.
Change is like a pendulum, swinging from one side to the other of a multi-dimensional space. People might be too loose, or too strict, too risky, or too risk-averse, too sexy, or too boring. When there is a response to newly identified problems, we often see people overreacting. If you can’t justify your decisions, you will just be pushed from one extreme on one of these dimensions to the opposite extreme. What you need is the weight of a solid justification to be able to resist being pulled in the direction of whatever you perceive to be the current norm. Learning The New Heuristics (for example setting the alpha level to 0.005 instead of 0.05) is not an improvement – it is just a change.
If we teach people The New Heuristics, we will get lost in the Bog of Meaningless Discussions About Why These New Norms Do Not Apply To Me. This is a waste of time. From a good justification it logically follows whether something applies to you or not. Don’t discuss heuristics – discuss justifications.
‘Why’ questions come at different levels. Surface level ‘why’ questions are explicitly left to the researcher – no one else can answer them. Why are you collecting 50 participants in each group? Why are you aiming for 80% power? Why are you using an alpha level of 5%? Why are you using this prior when calculating a Bayes factor? Why are you assuming equal variances and using Student’s t-test instead of Welch’s t-test? Part of the problem I am addressing here is that we do not discuss which questions are up to the researcher, and which are questions on a deeper level that you can simply accept without needing to provide a justification in your paper. This makes it relatively easy for researchers to pretend some ‘why’ questions are on a deeper level, and can be assumed without having to be justified. A field needs a continuing discussion about what we expect researchers to justify in their papers (for example by developing improved and detailed reporting guidelines). This will be an interesting discussion to have. For now, let’s limit ourselves to surface level questions that were always left up to researchers to justify (even though some researchers might not know any better than using a heuristic). In the spirit of the name of this blog, let’s focus on 20% of the problems that will improve 80% of what we do.
My new motto is ‘Justify Everything’ (it also works as a hashtag: #JustifyEverything). Your first response will be that this is not possible. You will think this is too much to ask. This is because you think that you will have to be able to justify everything. But that is not my view on good science. You do not have the time to learn enough to be able to justify all the choices you need to make when doing science. Instead, you could be working in a team of as many people as you need so that within your research team, there is someone who can give an answer if I ask you ‘Why?’. As a rule of thumb, a large enough research team in psychology has between 50 and 500 researchers, because that is how many people you need to make sure one of the researchers is able to justify why research teams in psychology need between 50 and 500 researchers.
Until we have transitioned into a more collaborative psychological science, we will be limited in how much and how well we can justify our decisions in our scientific articles. But we will be able to improve. Many journals are starting to require sample size justifications, which is a great example of what I am advocating for. Expert peer reviewers can help by pointing out where heuristics are used, but justifications are possible (preferably in open peer review, so that the entire community can learn). The internet makes it easier than ever before to ask other people for help and advice. And as with anything in a job as difficult as science, just get started. The #Justify20% hashtag will work just as well for now.

Does Your Philosophy of Science Matter in Practice?

In my personal experience philosophy of science rarely directly plays a role in how most scientists do research. Here I’d like to explore ways in which your philosophy of science might slightly shift how you weigh what you focus on when you do research. I’ll focus on different philosophies of science (instrumentalism, constructive empiricism, entity realism, and scientific realism), and explore how it might impact what you see as the most valuable way to make progress in science, how much you value theory-driven or data-driven research, and whether you believe your results should be checked against reality.

We can broadly distinguish philosophies of science (following Niiniluoto, 1999) in three main categories. First, there is the view that there is no truth, known as anarchism. An example of this can be found in Paul Feyerabend’s ‘Against Method’ where he writes: Science is an essentially anarchic enterprise: theoretical anarchism is more humanitarian and more likely to encourage progress than its law-and-order alternatives” and “The only principle that does not inhibit progress is: anything goes.” The second category contains pragmatism, in which ‘truth’ is replaced by some surrogate, such as social consensus. For example, Pierce (1878) writes: “The opinion which is fated to be ultimately agreed to by all who investigate, is what we mean by the truth, and the object represented in this opinion is the real. That is the way I would explain reality.”. Rorty doubts such a final end-point of consensus can ever be reached, and suggests giving up on the concept of truth and talk about an indefinite adjustment of belief. The third category, which we will focus on mostly below, consists of approaches that define truth as some correspondence between language and reality, known as correspondence theories. In essence, these approaches adhere to a dictionary definition of truth as ‘being in accord with fact or reality’. However, these approaches differ in whether they believe scientific theories have a truth value (i.e., whether theories can be true or false), and if theories have truth value, whether this is relevant for scientific practice.

What do you think? Is anarchy the best way to do science? Is there no truth, but at best an infinite updating of belief with some hope of social consensus? Or is there some real truth that we can get closer to over time?
Scientific Progress and Goals of Science
It is possible to have different philosophies of science because success in science is not measured by whether we discover the truth. After all, how would we ever know for sure we discovered the truth? Instead, a more tangible goal for science is to make scientific progress. This means we can leave philosophical discussions about what truth is aside, but it means we will have to define what progress in science looks like. And to have progress in science, science needs to have a goal.
Kitcher (1993, chapter 4) writes: “One theme recurs in the history of thinking about the goals of science: science ought to contribute to “the relief of man’s estate,” it should enable us to control nature—or perhaps, where we cannot control, to predict, and so adjust our behavior to an uncooperative world—it should supply the means for improving the quality and duration of human lives, and so forth.” Truth alone is not a sufficient aim for scientific progress, as Popper (1934/1959) already noted, because then we would just limit ourselves to positing trivial theories (e.g., for psychological science the theory that ‘it depends’), or collect detailed but boring information (the temperature in the room I am now in is 19.3 degrees). Kitcher highlights two important types of progress: conceptual progress and explanatory progress. Conceptual progress comes from refining the concepts we talk about, such that we can clearly specify these concepts, and preferably reach consensus about them. Explanatory progress is improved by getting a better understanding of the causal mechanisms underlying phenomena. Scientists will probably recognize the need for both. We need to clearly define our concepts, and know how to measure them, and we often want to know how things are related, or how to manipulate things.
This distinction between conceptual progress and explanatory progress aligns roughly with a distinction about progress with respect to the entitieswe study, and the theories we build that explain how these entities are related. A scientific theory is defined as a set of testable statements about the relation between observations. As noted before, philosophies of science differ in whether they believe statements about entities and theories are related to the truth, and if they are, whether this matters for how we do science. Let’s discuss four flavors of philosophies of science that differ in how much value they place in whether the way we talk about theories and entities corresponds to an objective truth.
Instrumentalism
According to instrumentalism, theories should be seen mainly as tools to solve practical problems, and not as truthful descriptions of the world. Theories are instruments that generate predictions about things we can observe. Theories often refer to unobservable entities, but these entities do not have truth or falsity, and neither do the theories. Scientific theories should not be evaluated based on whether they correspond to the true state of the world, but based on how well they perform.
One important reason to suspend judgment about whether theories are true or false is because of underdetermination (for an explanation, see Ladyman, 2002). We often do not have enough data to distinguish different possible theories. If it really isn’t possible to distinguish different theories because we would not be able to collect the required data, it is often difficult to say whether one theory is closer to the truth than another theory.
From an instrumentalist view on scientific progress, and assuming that all theories are underdetermined by data, additional criteria to evaluate theories become important, such as simplicity. Researchers might use approximations to make theories easier to implement, for example in computational models, based on the convictions that simpler theories provide more useful instruments, even if they are slightly less accurate about the true state of the world.
Constructive Empiricism
As opposed to instrumentalism, constructive empiricism acknowledges that theories can be true or not. However, it limits belief in theories only in as far as they describe observable events. Van Fraassen, one of the main proponents of constructive empiricism, suggests we can use a theory without believing it is true when it is empirically adequate. He says: “a theory is empirically adequate exactly if what it says about the observable things and events in the world, is true”. Constructive empiricists might decide to use a theory, but do not have to believe it is true. Theories often make statements that go beyond what we can observe, but constructive empiricists limit truth statements to observable entities. Because no truth values are ascribed to unobservable entities that are assumed to exist in the real world, this approach is grouped under ‘anti-realist’ philosophies on science.
Entity Realism
Entity realists are willing to take one step beyond constructive empiricism and acknowledge a belief in unobservable entities when a researcher can demonstrate impressive causal knowledge of an unobservable) entity. When knowledge about an unobservable entity can be used to manipulate its behavior, or if knowledge about the entity can be used to manipulate other phenomena, one can believe that it is real. However, researchers remain skeptical about scientific theories.
Hacking (1982) writes, in a very accessible article, how: “The vast majority of experimental physicists are realists about entities without a commitment to realism about theories. The experimenter is convinced of the existence of plenty of “inferred” and “unobservable” entities. But no one in the lab believes in the literal truth of present theories about those entities. Although various properties are confidently ascribed to electrons, most of these properties can be embedded in plenty of different inconsistent theories about which the experimenter is agnostic.” Researchers can be realists about entities, but anti-realists about models.
Scientific Realism
We can compare the constructive empiricist and entity realist views with scientific realism. For example, Niiniluoto (1999) writes that in contrast to a constructive empiricist:
“a scientific realist sees theories as attempts to reveal the true nature of reality even beyond the limits of empirical observation. A theory should be cognitively successful in the sense that the theoretical entities it postulates really exist and the lawlike descriptions of these entities are true. Thus, the basic aim of science for a realist is true information about reality. The realist of course appreciates empirical success like the empiricist. But for the realist, the truth of a theory is a precondition for the adequacy of scientific explanations.”
For scientific realists, verisimilitude, or ‘truthlikeness’ is treated as the basic epistemic utility of science. It is based on the empirical success of theories. As De Groot (1969) writes: “The criterion par excellence of true knowledge is to be found in the ability to predict the results of a testing procedure. If one knows something to be true, he is in a position to predict; where prediction is impossible, there is no knowledge.” Failures to predict are thus very impactful for a scientific realist.
Progress in Science
There are more similarities than differences between almost all philosophies of science. All approaches believe a goal of science is progress. Anarchists refrain from specifying what progress looks like. Feyerabend writes: “my thesis is that anarchism helps to achieve progress in any one of the senses one care to choose” – but progress is still a goal of science. For instrumentalists, the proof is in the pudding – theories are good, as long they lead to empirical progress, regardless of whether these theories are true. For a scientific realist, theories are better the closer the more verisimilitude they have, or the closer the get to an unknown truth. For all approaches (except perhaps anarchism) conceptual progress and explanatory progress are valued.
Conceptual progress is measured by increased accuracy in how a concept is measured, and increased consensuson what is measured. Progress concerning measurement accuracy is easily demonstrated since it is mainly dependent on the amount of data that is collected, and can be quantified by the standard error of the measurement. Consensus is perhaps less easily demonstrated, but Meehl (2004) provides some suggestions, such as a theory being generally talked about as a ‘fact’, research and technological applications use the theory but there is no need to study it directly anymore, and the only discussions of the theory at scientific meetings are as in panels about history or celebrations of past successes. We then wait for (and arguably arbitrary) 50 years to see if there is any change, and if not, we consider the theory accepted by consensus. Although Meehl acknowledged this is a somewhat brute-force approach to epistemology, he believes philosophers of science should be less distracted by exceptions such as Newtonian physics that was overthrown after 200 years, and acknowledge something like his approach will probably work in practice most of the time.
Explanatory progress is mainly measured by our ability to predict novel facts. Whether prediction(showing a theoretical prediction is supported by data) should be valued more than accommodation (adjusting a theory to accommodate unexpected observations) is a matter of debate. Some have argued that it doesn’t matter if a theory is stated before data is observed or after data is observed. Keynes writes: “The peculiar virtue of prediction or predesignation is altogether imaginary. The number of instances examined and the analogy between them are the essential points, and the question as to whether a particular hypothesis happens to be propounded before or after their examination is quite irrelevant.” It seems as if Keynes dismisses practices such as pre-registration, but his statement comes with a strong caveat, namely that researchers are completely unbiased. He writes: “to approach statistical evidence without preconceptions based on general grounds, because the temptation to ‘cook’ the evidence will prove otherwise to be irresistible, has no logical basis and need only be considered when the impartiality of an investigator is in doubt.”
Keynes’ analysis of prediction versus accommodation is limited to the evidence in the data. However, Mayo (2018) convincingly argues we put more faith in predicted findings than accommodated findings because the former have passed a severe test. If data is used when generating a hypothesis (i.e., the hypothesis has no use-novelty) the hypothesis will fit the data, no matter whether the theory is true or false. It is guaranteed to match the data, because the theory was constructed with this aim. A theory that is constructed based on the data has not passed a severe test. When novel data is collected in a well-constructed experiment, a hypothesis is unlikely to pass a test (e.g., yield a significant result) if the hypothesis is false. The strength from not using the data when constructing a hypothesis comes from the fact that is has passed a more severe test, and had a higher probability to be proven wrong (but wasn’t).
Does Your Philosophy of Science Matter?
Even if scientists generally agree that conceptual progress and explanatory progress are valuable, and that explanatory progress can be demonstrated by testing theoretical predictions, your philosophy of science likely influences how much you weigh the different questions researchers ask when they do scientific research. Research can be more theory driven, or more exploratory, and it seems plausible your views on which you value more is in part determined by your philosophy of science.
For example, do you perform research by formalizing strict theoretical predictions, and collect data to corroborate or falsify these predictions to increase the verisimilitude of the theory? Or do you largely ignore theories in your field, and aim to accurately measure relationships between variables? Developing strong theories can be useful for a scientific field, because they facilitate the organization of known phenomena, help to predict what will happen in new situations, and guide new research. Collecting reliable information about phenomena can provide the information needed to make decisions, and provides important empirical information that can be used to develop theories.
For a scientific realist a main aim is to test whether theories reflect reality. Scientific research starts with specifying a falsifiable theory. The goal of an experiment is to test the theory. If the theory passes the test, the theory gains verisimilitude, if it fails a test, it loses verisimilitude, and needs to be adjusted. If a theory repeatedly fails to make predictions (what Lakatos calls a degenerative research line) it is eventually abandoned. If the theory proves successful in making predictions, it becomes established knowledge.
For an entity realist like Hacking (1982), experiments provide knowledge about entities, and therefore experiments determine what we believe, not theories. He writes: “Hence, engineering, not theorizing, is the proof of scientific realism about entities.” Van Fraassen similarly stresses the importance of experiments, which are crucial in establishing facts about observable phenomena. He sees a role for theory, but it is quite different of the role it plays in scientific realism. Van Fraassen writes: “Scientists aim to discover facts about the world—about the regularities in the observable part of the world. To discover these, one needs experimentation as opposed to reason and reflection. But those regularities are exceedingly subtle and complex, so experimental design is exceedingly difficult. Hence the need for the construction of theories, and for appeal to previously constructed theories to guide the experimental inquiry.”
Theory-driven versus data-driven
One might be tempted to align philosophies of science along a continuum of how strongly theory driven they are (or confirmatory), and how strongly data-driven they are (or exploratory). Indeed, Van Fraassen writes: “The phenomenology of scientific theoretical advance may indeed be exactly like the phenomenology of exploration and discovery on the Dark Continent or in the South Seas, in certain respects.” Note that exploratory data-driven research is not void of theory – but the role theories play has changed. There are two roles, according to Fraassen. First, the outcome of an experiment is ‘filling in the blanks in a developing theory’. The second role theories play is in that, as the regularities we aim to uncover become more complex, we need theory to guide experimental design. Often a theory states there must be something, but it is very unclear what this something actually is.
For example, a theory might predict there are individual differences, or contextual moderators, but the scientist needs to discover which individual differences, or what contextual moderators. In this instance, the theory has many holes in it that need to be filled. As scientists fill in the blanks, there are typically new consequences that can be tested. As Fraassen writes: “This is how experimentation guides the process of theory construction, while at the same time the part of the theory that has already been constructed guides the design of the experiments that will guide the continuation”. For example, if we learn that as expected individual differences moderate an effect, and the effect is more pronounced for older compared to younger individuals, these experimental results guide theory construction. Fraassen goes as far as to say that “experimentation is the continuation of theory construction by other means.
For a scientific realist experimentation has the main goal to test theories, not to construct them. Exploration is still valuable but is less prominent in scientific realism. If a theory is at the stage where it predicts something will happen, but is not specific about what this something is, it is difficult to come up with a result that would falsify that prediction (except for cases where it is plausible that nothing would happen, which might be limited to highly controlled randomized experiments). Scientific realism requires well-specified theories. When data do not support theoretical predictions, this should be consequential. It means a theory is less ‘truth-like’ than we thought before.
Subjective or Objective Inferences?
Subjective beliefs in a theory or hypothesis play an important role in science. Beliefs are likely to have strong motivational power, leading scientists to invest time and effort in examining the things they examine. It has been a matter of debate whether subjective beliefs should play a role in the evaluation of scientific facts.
Both Fisher (1935) as Popper (1934/1959) disapproved of introducing subjective probabilities into statistical inferences. Fisher writes: “advocates of inverse probability seem forced to regard mathematical probability, not as an objective quantity measured by observed frequencies, but as measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes.” Popper writes: “We must distinguish between, on the one hand, our subjective experiences or our feelings of conviction, which can never justify any statement (though they can be made the subject of psychological investigation) and, on the other hand, the objective logical relations subsisting among the various systems of scientific statements, and within each of them.” For Popper, objectivity does not reside in theories, which he believes are never verifiable, but in tests of theories: “the objectivity of scientific statements lies in the fact that they can be inter-subjectively tested.”
This concern cuts across statistical approaches. Taper and Lele (2011) who are likelihoodists write: We dismiss Bayesianism for its use of subjective priors and a probability concept that conceives of probability as a measure of personal belief.” They continue: “It is not that we believe that Bayes’ rule or Bayesian mathematics is flawed, but that from the axiomatic foundational definition of probability Bayesianism is doomed to answer questions irrelevant to science. We do not care what you believe, we barely care what we believe, what we are interested in is what you can show.” Their approach seems closest to a scientific realism perspective. Although they acknowledge all models are false, they also write: “some models are better approximations of reality than other models” and “we believe that growth in scientific knowledge can be seen as the continual replacement of current models with models that approximate reality more closely.”
Gelman and Shalizi, who use Bayesian statistics but dislike subjective Bayes, write: “To reiterate, it is hard to claim that the prior distributions used in applied work represent statisticians’ states of knowledge and belief before examining their data, if only because most statisticians do not believe their models are true, so their prior degree of belief in all of ? [the parameter space used to generate a model] is not 1 but 0”. Although subjective Bayesians would argue that having no belief that your models are true is too dogmatic (it means you would never be convinced otherwise, regardless of how much data is collected) it is not unheard of in practice. Physicists know the Standard Model is wrong, but it works. This means they assign a probability of 0 to the standard model being true – which violates one of the core assumptions of Bayesian inference (which is that the plausibility assigned to a hypothesis can be represented as a number between 0 and 1). Gelman and Shalizi approach statistical inferences from a philosophy of science perhaps closest to constructive empiricism when they write: “Either way, we are using deductive reasoning as a tool to get the most out of a model, and we test the model – it is falsifiable, and when it is consequentially falsified, we alter or abandon it.”  
Both Taper and Lele (2011) as Gelman and Shalizi (2013) stress that models should be tested against reality. Taper and Lele want procedures that are reliable (i.e., are unlikely to yield incorrect conclusions in the long run), and that provide good evidence (i.e., when the data is in, it should provide strong relative support for one model over another). They write: “We strongly believe that one of the foundations of effective epistemology is some form of reliabilism. Under reliabilism, a belief (or inference) is justified if it is formed from a reliable process.” Similarly, Gelman and Shalizi write: “the hypothesis linking mathematical models to empirical data is not that the data-generating process is exactly isomorphic to the model, but that the data source resembles the model closely enough, in the respects which matter to us, that reasoning based on the model will be reliable.”
As an example of an alternative viewpoint, we can consider a discussion about whether optional stopping (repeatedly analyzing data and stopping the data analysis whenever the data supports predictions) is problematic or not. In Frequentist statistics the practice of optional stopping inflates the error rate (and thus, to control the error rate, the alpha level needs to be adjusted when sequential analyses are performed). Rouder (2014) believes optional stopping is no problem for Bayesians. He writes: “In my opinion, the key to understanding Bayesian analysis is to focus on the degree of belief for considered models, which need not and should not be calibrated relative to some hypothetical truth.” It is the responsibility of researchers to choose models they are interested in (although how this should be done is still a matter of debate). Bayesian statistics allows researchers to update their belief concerning these models based on the data – irrespective of whether these models have a relation with reality. Although optional stopping increases error rates (for an excellent discussion, see Mayo, 2018), and this reduces the severity of the tests the hypotheses pass (which is why researchers worry about such practices undermining reproducibility) such concerns are not central to a subjective Bayesian approach to statistical inferences.
Can You Pick Only One Philosophy of Science?
Ice-cream stores fare well selling cones with more than one scoop. Sure, it can get a bit messy, but sometimes choosing is just too difficult. When it comes to philosophy of science, do you need to pick one approach and stick to it for all problems? I am not sure. Philosophers seem to implicitly suggest this (or at least they don’t typically discuss the pre-conditions to adopt their proposed philosophy of science, and seem to imply their proposal generalizes across fields and problems within fields).
Some viewpoints (such as whether there is a truth or not, and if theories have some relation to truth) seem rather independent of the research context. It is still fine to change your view over time (philosophers of science themselves change their opinion over time!) but they are probably somewhat stable.
Other viewpoints seem to leave more room for flexibility depending on the research you are doing. You might not believe theories in a specific field are good enough to be used as anything but crude verbal descriptions of phenomena. I teach an introduction to psychology course to students at the Eindhoven Technical University, and near the end of one term a physics student approached me after class and said: “You very often use the word ‘theory’, but many of these ‘theories’ don’t really sound like theories”. If you have ever tried to create computational models of psychological theories, you will have experienced it typically cannot be done: Theories lack sufficient detail. Furthermore, you might feel the concepts used in your research area are not specified enough to really know what we are talking about. If this is the case, you might not have the goal to test theories (or try to explain phenomena) but mainly want to focus on conceptual progress by improving measurement techniques or accurately estimate their effect sizes. Or you might work on more applied problems and believe that a specific theory is just a useful instrument that guides you towards possibly interesting questions, but is not in itself something that can be tested, or that accurately describes reality.
Conclusion
Researchers often cite and use theories in their research, but they are rarely explicit about what these theories mean to them. Do you believe a theory reflects some truth in the world, or are they just useful instruments to guide research that should not be believed to be true? Is the goal of your research to test theories, or to construct theories? Do you have a strong belief that the unobservable entities you are studying are real, or do you prefer to limit your belief to statements about things you can directly observe? Being clear about where you stand with respect to these questions might make it clear what different scientists expect scientific progress should look like and clarify what their goals are when they collect data. It might explain differences in how people respond when a theoretical prediction is not confirmed, or why some researchers prefer to accurately measure the entities they study, while others prefer to test theoretical predictions.
References
Feyerabend, P. (1993). Against method. London: Verso.
Fisher, R. A. (1935). The design of experiments. Oliver And Boyd; London.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10/f4k2h4
Hacking, I. (1982). Experimentation and Scientific Realism. Philosophical Topics, 13(1), 71–87. https://doi.org/10/fz8ftm
Keynes, J. M. (1921). A Treatise on Probability. Cambridge University Press.
Kitcher, P. (1993). The advancement of science: science without legend, objectivity without illusions. New York: Oxford University Press.
Ladyman, J. (2002). Understanding philosophy of science. London?; New York: Routledge.
Mayo, D. G. (2018). Statistical inference as severe testing: how to get beyond the statistics wars. Cambridge: Cambridge University Press.
Meehl, P. E. (2004). Cliometric metatheory III: Peircean consensus, verisimilitude and asymptotic method. The British Journal for the Philosophy of Science, 55(4), 615–643.
Niiniluoto, I. (1999). Critical Scientific Realism. Oxford University Press.
Popper, K. R. (1959). The logic of scientific discovery. London; New York: Routledge.
Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301–308.
Taper, M. L., & Lele, S. R. (2011). Evidence, Evidence Functions, and Error Probabilities. In P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of Statistics (Vol. 7, pp. 513–532). Amsterdam: North-Holland. https://doi.org/10.1016/B978-0-444-51862-0.50015-0

The Only Substitute for Metrics is Better Metrics

Comment on: Mryglod, Olesya, Ralph Kenna, Yurij Holovatch and Bertrand Berche (2014) Predicting the results of the REF using departmental h-index: A look at biology, chemistry, physics, and sociology. LSE Impact Blog 12(6)


The man who is ready to prove that metaphysical knowledge is wholly impossible? is a brother metaphysician with a rival theory? Bradley, F. H. (1893) Appearance and Reality

The topic of using metrics for research performance assessment in the UK has a rather long history, beginning with the work of Charles Oppenheim.

The solution is neither to abjure metrics nor to pick and stick to one unvalidated metric, whether it?s the journal impact factor or the h-index.

The solution is to jointly test and validate, field by field, a battery of multiple, diverse metrics (citations, downloads, links, tweets, tags, endogamy/exogamy, hubs/authorities, latency/longevity, co-citations, co-authorships, etc.) against a face-valid criterion (such as peer rankings).



      See also: “On Metrics and Metaphysics” (2008)

Oppenheim, C. (1996). Do citations count? Citation indexing and the Research Assessment Exercise (RAE). Serials: The Journal for the Serials Community, 9(2), 155-161.

Oppenheim, C. (1997). The correlation between citation counts and the 1992 research assessment exercise ratings for British research in genetics, anatomy and archaeology. Journal of documentation, 53(5), 477-487.

Oppenheim, C. (1995). The correlation between citation counts and the 1992 Research Assessment Exercise Ratings for British library and information science university departments. Journal of Documentation, 51(1), 18-27.

Oppenheim, C. (2007). Using the h-index to rank influential British researchers in information science and librarianship. Journal of the American Society for Information Science and Technology, 58(2), 297-301.

Harnad, S. (2001) Research access, impact and assessment. Times Higher Education Supplement 1487: p. 16.

Harnad, S. (2003) Measuring and Maximising UK Research Impact. Times Higher Education Supplement. Friday, June 6 2003

Harnad, S., Carr, L., Brody, T. & Oppenheim, C. (2003) Mandated online RAE CVs Linked to University Eprint Archives: Improving the UK Research Assessment Exercise whilst making it cheaper and easier. Ariadne 35.

Hitchcock, Steve; Woukeu, Arouna; Brody, Tim; Carr, Les; Hall, Wendy and Harnad, Stevan. (2003) Evaluating Citebase, an open access Web-based citation-ranked search and impact discovery service Technical Report, ECS, University of Southampton.

Harnad, S. (2004) Enrich Impact Measures Through Open Access Analysis. British Medical Journal BMJ 2004; 329:

Harnad, S. (2006) Online, Continuous, Metrics-Based Research Assessment. Technical Report, ECS, University of Southampton.  

Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage Statistics as Predictors of Later Citation Impact. Journal of the American Association for Information Science and Technology (JASIST) 57(8) pp. 1060-1072.

Brody, T., Carr, L., Harnad, S. and Swan, A. (2007) Time to Convert to Metrics. Research Fortnight 17-18.

Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and Swan, A. (2007) Incentivizing the Open Access Research Web: Publication-Archiving, Data-Archiving and Scientometrics. CTWatch Quarterly 3(3).

Harnad, S. (2008) Validating Research Performance Metrics Against Peer Rankings. Ethics in Science and Environmental Politics 8 (11) doi:10.3354/esep00088 The Use And Misuse Of Bibliometric Indices In Evaluating Scholarly Performance

Harnad, S. (2008) Self-Archiving, Metrics and Mandates. Science Editor 31(2) 57-59

Harnad, S., Carr, L. and Gingras, Y. (2008) Maximizing Research Progress Through Open Access Mandates and Metrics. Liinc em Revista 4(2).

Harnad, S. (2009) Open Access Scientometrics and the UK Research Assessment Exercise. Scientometrics 79 (1) Also in Proceedings of 11th Annual Meeting of the International Society for Scientometrics and Informetrics 11(1), pp. 27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds. (2007)

Harnad, S. (2009) Multiple metrics required to measure research performance. Nature (Correspondence) 457 (785) (12 February 2009)

Harnad, S; Carr, L; Swan, A; Sale, A & Bosc H. (2009) Maximizing and Measuring Research Impact Through University and Research-Funder Open-Access Self-Archiving Mandates. Wissenschaftsmanagement 15(4) 36-41

Progressive vs Treadwater Fields

There are many reasons why grumbling about attempts to replicate are unlikely in the physical or even the biological sciences, but the main reason is that in most other sciences research is cumulative:

Experimental and observational findings that are worth knowing are those on which further experiments and observations can be built, for an ever fuller and deeper causal understanding of the system under study, whether the solar system or the digestive system. If the finding is erroneous, the attempts to build on it collapse. Cumulative replication is built into the trajectory of research itself ? for those findings that are worth knowing.

In contrast, if no one bothers to build anything on it, chances are that a finding was not worth knowing (and so it matters little whether it would replicate or not, if tested again).

Why is it otherwise in many areas of Psychology? Why do the outcomes of so many one-shot, hit-and-run studies keep being reported in textbooks?

Because so much of Psychology is not cumulative explanatory research at all. It is helter-skelter statistical outcomes that manage to do two things: (1) meet a crierion for statistical significance (i.e., a low probability that they occurred by chance) and (2) are amenable to an attention-catching interpretation.

No wonder that their authors grumble when replicators spoil the illusion.

Yes, open access, open commentary and crowd-sourcing are needed in all fields, for many reasons, but for one reason more in hit-and-run fields.

Spurning the Better to Keep Burning for the Best

Björn Brembs (as interviewed by Richard Poynder) is not satisfied with “read access” (free online access: Gratis OA): he wants “read/write access” (free online access plus re-use rights: Libre OA).

The problem is that we are nowhere near having even the read-access that Björn is not satisfied with.

So his dissatisfaction is not only with something we do not yet have, but with something that is also an essential component and prerequsite for read/write access. Björn wants more, now, when we don’t even have less.

And alas Björn does not give even a hint of a hint of a practical plan for getting read/write access instead of “just” the read access we don’t yet have.

All he proposes is that a consortium of rich universities should cancel journals and take over.

Before even asking what on earth those universities would/should/could do, there is the question of how their users would get access to all those cancelled journals (otherwise this “access” would be even less than less!). Björn’s reply — doubly alas — uses the name of my eprint-request Button in vain:

The eprint-request Button is only legal, and only works, because authors are providing access to individual eprint requestors for their own articles. If the less-rich universities who were not part of this brave take-over consortium of journal-cancellers were to begin to provide automatic Button-access to all those extra-institutional users, their institutional license costs (subscriptions) would sky-rocket, because their Big-Deal license fees are determined by publishers on the basis of the size of each institution’s total usership, which would now include all the users of all the cancelling institutions, on Björn’s scheme.

So back to the work-bench on that one.

Björn seems to think that OA is just a technical matter, since all the technical wherewithal is already in place, or nearly so. But in fact, the technology for Green Gratis (“read-only”) OA has been in place for over 20 years, and we are still nowhere near having it. (We may, optimistically, be somewhere between 20-30%, though certainly not even the 50% that Science-Metrix has optimistically touted recently as the “tipping point” for OA — because much of that is post-embargo, hence Delayed Access (DA), not OA.

Björn also seems to have proud plans for post-publication “peer review” (which is rather like finding out whether the water you just drank was drinkable on the basis of some crowd-sourcing after you drank it).

Post-publication crowd-sourcing is a useful supplement to peer review, but certainly not a substitute for it.

All I can do is repeat what I’ve had to say so many times across the past 20 years, as each new generation first comes in contact with the access problem, and proposes its prima facie solutions (none of which are new: they have all been proposed so many times that they — and their fatal flaws — have already have each already had their own FAQs for over a decade.) The watchword here, again, is that the primary purpose of the Open Access movement is to free the peer-reviewed literature from access-tolls — not to free it from peer-review. And before you throw out the peer review system, make sure you have a tried, tested, scalable and sustainable system with which to replace it, one that demonstrably yields at least the same quality (and hence usability) as the existing system does.

Till then, focus on freeing access to the peer-reviewed literature such as it is.

And that’s read-access, which is much easier to provide than read-write access. None of the Green (no-embargo) publishers are read-write Green: just read-Green. Insisting on read-write would be an excellent way to get them to adopt and extend embargoes, just as the foolish Finch preference for Gold did (and just as Rick Anderson‘s absurd proposal to cancel Green (no-embargo) journals would do).

And, to repeat: after 20 years, we are still nowhere near 100% read-Green, largely because of phobias about publisher embargoes on read-Green. Björn is urging us to insist on even more than read-Green. Another instance of letting the (out-of-reach) Best get in the way of the (within-reach) Better. And that, despite the fact that it is virtually certain that once we have 100% read-Green, the other things we seek — read-write, Fair-Gold, copyright reform, publishing reform, perhaps even peer review reform — will all follow, as surely as day follows night.

But not if we contribute to slowing our passage to the Better (which there is already a tried and tested means of reaching, via institutional and funder mandates) by rejecting or delaying the Better in the name of holding out for a direct sprint to the Best (which no one has a tried and tested means of reaching, other than to throw even more money at publishers for Fool’s Gold). Björn’s speculation that universities should cancel journals, rely on interlibrary loan, and scrap peer-review for post-hoc crowd-sourcing is certainly not a tried and tested means!

As to journal ranking and citation impact factors: They are not the problem. No one is preventing the use of article- and author-based citation counts in evaluating articles and authors. And although the correlation between journal impact factors and journal quality and importance is not that big, it’s nevertheless positive and significant. So there’s nothing wrong with libraries using journal impact factors as one of a battery of many factors (including user surveys, usage metrics, institutional fields of interest, budget constraints, etc.) in deciding which journals to keep or cancel. Nor is there anything wrong with research performance evaluation committees using journal impact factors as one of a battery of many factors (alongside article metrics, author metrics, download counts, publication counts, funding, doctoral students, prizes, honours, and peer evaluations) in assessing and rewarding research progress.

The problem is neither journal impact factors nor peer review: The only thing standing between the global research community and 100% OA (read-Green) is keystrokes. Effective institutional and funder mandates can and will ensure that those keystrokes are done. Publisher embargoes cannot stop them: With immediate-deposit mandates, 100% of articles (final, refereed drafts) are deposited in the author’s institutional repository immediately upon acceptance for publication. At least 60% of them can be made immediately OA, because at least 60% of journals don’t embargo (read-Green) OA; access to the other 40% of deposits can be made Restricted Access, and it is there that the eprint-request Button can provide Almost-OA with one extra keystroke from the would-be user to request it and one extra keystroke from the author to fulfill the request.

That done, globally, and we can leave it to nature (and human nature) to ensure that the “Best” (100% immediate OA, subscription collapse, conversion to Fair Gold, all the re-use rights users need, and even peer-review reform) will soon follow.

But not as long as we continue spurning the Better and just burning for the Best.

Stevan Harnad

Paid-Gold OA, Free-Gold OA & Journal Quality Standards

Peter Suber has pointed out that “About 50% of articles published in peer-reviewed OA journals are published in fee-based journals” (as reported by Laakso & Bjork 2012).

Laakso & Bjork also report that “[12% of] articles published during 2011 and indexed in the most comprehensive article-level index of scholarly articles (Scopus) are available OA through journal publishers… immediately…”.

That’s 12% immediate Gold-OA for the (already selective) SCOPUS sample. The percentage is still smaller for the more selective Thomson-Reuters/ISI sample. 

I think it cannot be left out of the reckoning about paid-Gold OA vs. free-Gold OA that: 

(#1) most articles are not published as Gold OA at all today (neither paid-Gold nor free-Gold)

(#2) the articles of the quality that users need and want most are much less likely to be published as Gold OA (whether paid-Gold or free-Gold) today, and, most important,

(#3) the Gold OA articles of the quality that users need and want most today are less likely to be the free-Gold ones than the paid-Gold ones (even though the junk journals on Jeffrey Beall’s “predatory” Gold OA journal list are all paid-Gold).

#2 and #3 are hypotheses, but I think they can be tested objectively.

A test for #2 would be to compare the download and citation counts (not the journal impact factors) for Gold OA (including hybrid Gold) articles vs non-Gold subscription journal articles (excluding the ones that have been made Green OA) within the same subject (and language!) area.

A test for #3 would be to compare the download and citation counts (not the journal impact factors) for paid-Gold (including hybrid Gold) vs free-gold articles within the same subject (and language!) area.

I mention this because I think just comparing the number of paid-Gold vs. free-Gold journals without taking quality into account could be misleading.

Comparing Carrots and Lettuce

These are comments on Stephen Curry’s
The inexorable rise of open access scientific publishing“.


Our (Gargouri, Lariviere, Gingras, Carr & Harnad) estimate (for publication years 2005-2010, measured in 2011, based on articles published in the c. 12,000 journals indexed by Thomson-Reuters ISI) is 35% total OA in the UK (10% above the worldwide total OA average of 25%): This is the sum of both Green and Gold OA.

Our sample yields a Gold OA estimate much lower than Laakso & Björk‘s. Our estimate of about 25% OA worldwide is composed of 22.5% Green plus 2.5% Gold. And the growth rate of neither Gold nor (unmandated) Green is exponential.

There are a number of reasons neither “carrots vs. lettuce” nor “UK vs. non-UK produce” nor L&B estimates vs. G et al estimates can be compared or combined in a straightforward way.

Please take the following as coming from a fervent supporter of OA, not an ill-wisher, but one who has been disappointed across the long years by far too many failures to seize the day — amidst surges of “tipping-point” euphoria — to be ready once again to tout triumph.

First, note that the hubbub is yet again about Gold OA (publishing), even though all estimates agree that there is far less of Gold OA than there is of Green OA (self-archiving), and even though it is Green OA that can be fast-forwarded to 100%: all it takes is effective Green OA mandates (I will return to this point at the end).

So Stephen Curry asks why there is a discrepancy between our (Gargouri et al) estimates of Gold OA — in the UK and worldwide (c. <5%) -- the estimates of Laakso & Björk (17%). Here are some of the multiple reasons (several of them already pointed out by Richard van Noorden in his comments too):

1. Thomson-Reuters ISI Subset: Our estimates are based solely on articles in the Thomson-Reuters ISI database of c. 12,000 journals. This database is more selective than the SCOPUS database on which L&B’s sample is based. The more selective journals have higher quality standards and are hence the ones that both authors and users prefer.

(Without getting into the controversy about journal citation impact factors, another recent L&B study has shown that the higher the journal’s impact factor, the less likely that the journal is Gold OA. — But let me add that this is now likely to change, because of the perverse effects of the Finch Report and the RCUK OA Policy: Thanks to the UK’s announced readiness to divert UK research funds to double-paying subscription journal publishers for hybrid Gold OA, most journals, including the top journals, will soon be offering hybrid Gold OA — a very pricey way to add the UK’s 6% of worldwide research output to the worldwide Gold OA total: The very same effect could be achieved free of extra cost if RCUK instead adopted a compliance-verification mechanism for its existing Green OA mandates.)

2. Embargoed “Gold OA”: L&B included in their Gold OA estimates “OA” that was embargoed for a year. That’s not OA, and certainly should not be credited to the total OA for any given year — whence it is absent — but to the next year. By that time, the Green OA embargoes of most journals have already expired. So, again, any OA purchased in this pricey way — instead of for a few extra cost-free keystrokes by the author, for Green — is more of a head-shaker than occasion for heady triumph.

3. 1% Annual Growth: The 1% annual growth of Gold OA is not much headway either, if you do the growth curves for the projected date they will reach 100%! (The more heady Gold OA growth percentages are not Gold OA growth as a percentage of all articles published, but Gold OA growth as a percentage of the preceding year’s Gold OA articles.)

4. Green Achromatopsia: The relevant data for comparing Gold OA — both its proportion and its growth rate — with Green come from a source L&B do not study, namely, institutions with (effective) Green OA mandates. Here the proportions within two years of mandate adoption (60%+) and the subsequent growth rate toward 100% eclipse not only the worldwide Gold OA proportions and growth rate, but also the larger but still unimpressive worldwide Green OA proportions and growth rate for unmandated Green OA (which is still mostly all there is).

5. Mandate Effectiveness: Note also that RCUK’s prior Green OA mandate was not an effective one (because it had no compliance verification mechanism), even though it may have increased UK OA (35%) by 10% over the global average (25%).

Stephen Curry: “A cheaper green route is also available, whereby the author usually deposits an unformatted version of the paper in a university repository without incurring a publisher’s charge, but it remains to be seen if this will be adopted in practice. Universities and research institutions are only now beginning to work out how to implement the new policy (recently clarified by the RCUK).”

Well, actually RCUK has had Green OA mandates for over a half-decade now. But RCUK has failed to draw the obvious conclusion from its pioneering experiment — which is that the RCUK mandates require an effective compliance-verification mechanism (of the kind that the effective university mandates have — indeed, the universities themselves need to be recruited as the compliance-verifiers).

Instead, taking their cue from the Finch Report — which in turn took its cue from the publisher lobby — RCUK is doing a U-turn from its existing Green OA mandate, and electing to double-pay publishers for Gold instead.

A much more constructive strategy would be for RCUK to build on its belated grudging concession (that although Gold is RCUK’s preference, RCUK fundees may still choose Green) by adopting an effective Green OA compliance verification mechanism. That (rather than the obsession with how to spend “block grants” for Gold) is what the fundees’ institutions should be recruited to do for RCUK.

6. Discipline Differences: The main difference between the Gargouri, Lariviere, Gingras, Carr & Harnad estimates of average percent Gold in the ISI sample (2.5%) and the Laakso & Bjork estimates (10.3% for 2010) probably arise because L&B’s sample included all ISI articles per year for 12 years (2000-2011), whereas ours was a sample of 1300 articles per year, per discipline, separately, for each of 14 disciplines, for 6 years (2005-2010: a total of about 100,000 articles).

7. Biomedicine Preponderance? Our sample was much smaller than L&B’s because L&B were just counting total Gold articles, using DOAJ, whereas we were sending out a robot to look for Green OA versions on the Web for each of the 100,000 articles in our sample. It may be this equal sampling across disciplines that leads to our lower estimates of Gold: L&B’s higher estimate may reflect the fact that certain disciplines are both more Gold and publish more articles (in our sample, Biomed was 7.9% Gold). Note that both studies agree on the annual growth rate of Gold (about 1%)

8. Growth Spurts? Our projection does not assume a linear year-to-year growth rate (1%), it detects it. There have so far been no detectable annual growth spurts (of either Gold or Green). (I agree, however, that Finch/RCUK could herald one forthcoming annual spurt of 6% Gold (the UK’s share of world research output) — but that would be a rather pricey (and, I suspect, unscaleable and unsustainable) one-off growth spurt. )

9. RCUK Compliance Verification Mechanism for Green OA Deposits: I certainly hope Stephen Curry is right that I am overstating the ambiguity of the RCUK policy!

But I was not at all reassured at the LSHTM meeting on Open Access by Ben Ryan’s rather vague remarks about monitoring RCUK mandate compliance, especially compliance with Green. After all that (and not the failure to prefer and fund Gold) was the main weakness of the prior RCUK OA mandate.

Stevan Harnad