2021 Replicability Report for the Psychology Department at New York University

Introduction

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

New York University

I used the department website to find core members of the psychology department. I found 13 professors and 6 associate professors. Figure 1 shows the z-curve for all 12,365 tests statistics in articles published by these 19 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,239 (~ 10%) of z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (dashed blue/red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red/white line shows significance for p < .10, which is often used for marginal significance. There is another drop around this level of significance.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 20% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 20% EDR provides an estimate of the extent of selection for significance. The difference of 50 percentage points is large. The upper level of the 95% confidence interval for the EDR is 28%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR is similar (70 vs. 72%), but the EDR is a bit lower (20% vs. 28%), although the difference might be largely due to chance.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 20% implies that no more than 20% of the significant results are false positives, however the upper limit of the 95%CI of the EDR, 28%, allows for 36% false positive results. Most readers are likely to agree that this is an unacceptably high risk that published results are false positives. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of XX%. Thus, without any further information readers could use this criterion to interpret results published by NYU faculty members.

5. The estimated replication rate is based on the mean power of significant studies (Brunner & Schimmack, 2020). Under ideal condition, mean power is a predictor of the success rate in exact replication studies with the same sample sizes as the original studies. However, as NYU professor van Bavel pointed out in an article, replication studies are never exact, especially in social psychology (van Bavel et al., 2016). This implies that actual replication studies have a lower probability of producing a significant result, especially if selection for significance is large. In the worst case scenario, replication studies are not more powerful than original studies before selection for significance. Thus, the EDR provides an estimate of the worst possible success rate in actual replication studies. In the absence of further information, I have proposed to use the average of the EDR and ERR as a predictor of actual replication outcomes. With an ERR of 62% and an EDR of 20%, this implies an actual replication prediction of 41%. This is close to the actual replication rate in the Open Science Reproducibility Project (Open Science Collaboration, 2015). The prediction for results published in 120 journals in 2010 was (ERR = 67% + ERR = 28%)/ 2 = 48%. This suggests that results published by NYU faculty are slightly less replicable than the average result published in psychology journals, but the difference is relatively small and might be mostly due to chance.

6. There are two reasons for low replication rates in psychology. One possibility is that psychologists test many false hypotheses (i.e., H0 is true) and many false positive results are published. False positive results have a very low chance of replicating in actual replication studies (i.e. 5% when .05 is used to reject H0), and will lower the rate of actual replications a lot. Alternative, it is possible that psychologists tests true hypotheses (H0 is false), but with low statistical power (Cohen, 1961). It is difficult to distinguish between these two explanations because the actual rate of false positive results is unknown. However, it is possible to estimate the typical power of true hypotheses tests using Soric’s FDR. If 20% of the significant results are false positives, the power of the 80% true positives has to be (.62 – .2*.05)/.8 = 76%. This would be close to Cohen’s recommended level of 80%, but with a high level of false positive results. Alternatively, the null-hypothesis may never be really true. In this case, the ERR is an estimate of the average power to get a significant result for a true hypothesis. Thus, power is estimated to be between 62% and 76%. The main problem is that this is an average and that many studies have less power. This can be seen in Figure 1 by examining the local power estimates for different levels of z-scores. For z-scores between 2 and 2.5, the ERR is only 47%. Thus, many studies are underpowered and have a low probability of a successful replication with the same sample size even if they showed a true effect.

Area

The results in Figure 1 provide highly aggregated information about replicability of research published by NYU faculty. The following analyses examine potential moderators. First, I examined social and cognitive research. Other areas were too small to be analyzed individually.

The z-curve for the 11 social psychologists was similar to the z-curve in Figure 1 because they provided more test statistics and had a stronger influence on the overall result.

The z-curve for the 6 cognitive psychologists looks different. The EDR and ERR are higher for cognitive psychology, and the 95%CI for social and cognitive psychology do not overlap. This suggests systematic differences between the two fields. These results are consistent with other comparisons of the two fields, including actual replication outcomes (OSC, 2015). With an EDR of 44%, the false discovery risk for cognitive psychology is only 7% with an upper limit of the 95%CI at 12%. This suggests that the conventional criterion of .05 does keep the false positive risk at a reasonably low level or that an adjustment to alpha = .01 is sufficient. In sum, the results show that results published by cognitive researchers at NYU are more replicable than those published by social psychologists.

Position

Since 2015 research practices in some areas of psychology, especially social psychology, have changed to increase replicability. This would imply that research by younger researchers is more replicable than research by more senior researchers that have more publications before 2015. A generation effect would also imply that a department’s replicability increases when older faculty members retire. On the other hand, associate professors are relatively young and likely to influence the reputation of a department for a long time.

The figure above shows that most test statistics come from the (k = 13) professors. As a result, the z-curve looks similar to the z-curve for all test values in Figure 1. The results for the 6 associate professors (below) are more interesting. Although five of the six associate professors are in the social area, the z-curve results show a higher EDR and less selection bias than the plot for all social psychologists. This suggests that the department will improve when full professors in social psychology retire.

The table below shows the meta-statistics of all 19 faculty members. You can see the z-curve for each faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Jonathan B. Freeman 84 86 82 1
2 Karen E. Adolph 66 76 56 4
3 Bob Rehder 61 75 47 6
4 Marjorie Rhodes 58 68 48 6
5 David M. Amodio 53 65 40 8
6 Brian McElree 54 59 49 6
7 Todd M. Gureckis 49 75 23 17
8 Emily Balcetis 48 68 28 13
9 Eric D. Knowles 48 60 35 10
10 Jay J. van Bavel 55 66 44 7
11 Tessa V. West 46 55 37 9
12 Madeline E. Heilman 44 66 23 18
13 Catherine A. Hartley 45 70 19 23
14 John T. Jost 44 62 26 15
15 Andrei Cimpian 42 64 20 21
16 Peter M. Gollwitzer 36 54 18 25
17 Yaacov Trope 34 54 14 32
18 Gabriele Oettingen 30 46 14 32
19 Susan M. Andersen 30 47 13 35

Incidental Anchoring Bites the Dust

Update: 6/10/21

After I posted this post, I learned about a published meta-analysis and new studies of incidental anchoring by David Shanks and colleagues that came to the same conclusion (Shanks et al., 2020).

Introduction

“The most expensive car in the world costs $5 million. How much does a new BMW 530i cost?”

According to anchoring theory, information about the most expensive car can lead to higher estimates for the cost of a BMW. Anchoring effects have been demonstrated in many credible studies since the 1970s (Kahneman & Tversky, 1973).

A more controversial claim is that anchoring effects even occur when the numbers are unrelated to the question and presented incidentally (Criticher & Gilovich, 2008). In one study, participants saw a picture of a football player and were asked to guess how likely it is that the player will sack the football player in the next game. The player’s number on jersey was manipulated to be 54 or 94. The study produced a statistically significant result suggesting that a higher number makes people give higher likelihood judgments. This study started a small literature on incidental anchoring effects. A variation on this them are studies that presented numbers so briefly on a computer screen that most participants did not actually see the numbers. This is called subliminal priming. Allegedly, subliminal priming also produced anchoring effects (Mussweiler & Englich (2005).

Since 2011, many psychologists are skeptical whether statistically significant results in published articles can be trusted. The reason is that researchers only published results that supported their theoretical claims even when the claims were outlandish. For example, significant results also suggested that extraverts can foresee where pornographic images are displayed on a computer screen even before the computer randomly selected the location (Bem, 2011). No psychologist, except Bem, believes these findings. More problematic is that many other findings are equally incredible. A replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2005). So, the question is whether incidental and subliminal anchoring are more like classic anchoring or more like extrasensory perception.

There are two ways to assess the credibility of published results when publication bias is present. One approach is to conduct credible replication studies that are published independent of the outcome of a study. The other approach is to conduct a meta-analysis of the published literature that corrects for publication bias. A recent article used both methods to examine whether incidental anchoring is a credible effect (Kvarven et al., 2020). In this article, the two approaches produced inconsistent results. The replication study produced a non-significant result with a tiny effect size, d = .04 (Klein et al., 2014). However, even with bias-correction, the meta-analysis suggested a significant, small to moderate effect size, d = .40.

Results

The data for the meta-analysis were obtained from an unpublished thesis (Henriksson, 2015). I suspected that the meta-analysis might have coded some studies incorrectly. Therefore, I conducted a new meta-analysis, using the same studies and one new study. The main difference between the two meta-analysis is that I coded studies based on the focal hypothesis test that was used to claim evidence for incidental anchoring. The p-values were then transformed into fisher-z transformed correlations and and sampling error, 1/sqrt(N – 3), based on the sample sizes of the studies.

Whereas the old meta-analysis suggested that there is no publication bias, the new meta-analysis showed a clear relationship between sampling error and effect sizes, b = 1.68, se = .56, z = 2.99, p = .003. Correcting for publication bias produced a non-significant intercept, b = .039, se = .058, z = 0.672, p = .502, suggesting that the real effect size is close to zero.

Figure 1 shows the regression line for this model in blue and the results from the replication study in green. We see that the blue and green lines intersect when sampling error is close to zero. As sampling error increases because sample sizes are smaller, the blue and green line diverge more and more. This shows that effect sizes in small samples are inflated by selection for significance.

However, there is some statistically significant variability in the effect sizes, I2 = 36.60%, p = .035. To further examine this heterogeneity, I conducted a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). A z-curve analysis converts p-values into z-statistics. The histogram of these z-statistics shows publication bias, when z-statistics cluster just above the significance criterion, z = 1.96.

Figure 2 shows a big pile of just significant results. As a result, the z-curve model predicts a large number of non-significant results that are absent. While the published articles have a 73% success rate, the observed discovery rate, the model estimates that the expected discovery rate is only 6%. That is, for every 100 tests of incidental anchoring, only 6 studies are expected to produce a significant result. To put this estimate in context, with alpha = .05, 5 studies are expected to be significant based on chance alone. The 95% confidence interval around this estimate includes 5% and is limited at 26% at the upper end. Thus, researchers who reported significant results did so based on studies with very low power and they needed luck or questionable research practices to get significant results.

A low discovery rate implies a high false positive risk. With an expected discovery rate of 6%, the false discovery risk is 76%. This is unacceptable. To reduce the false discovery risk, it is possible to lower the alpha criterion for significance. In this case, lowering alpha to .005 produces a false discovery risk of 5%. This leaves 5 studies that are significant.

One notable study with strong evidence, z = 3.70, examined anchoring effects for actual car sales. The data came from an actual auction of classic cars. The incidental anchors were the prices of the previous bid for a different vintage car. Based on sales data of 1,477 cars, the authors found a significant effect, b = .15, se = .04 that translates into a standardized effect size of d = .2 (fz = .087). Thus, while this study provides some evidence for incidental anchoring effects in one context, the effect size estimate is also consistent with the broader meta-analysis that effect sizes of incidental anchors are fairly small. Moreover, the incidental anchor in this study is still in the focus of attention and in some way related to the actual bid. Thus, weaker effects can be expected for anchors that are not related to the question at all (a player’s number) or anchors presented outside of awareness.

Conclusion

There is clear evidence that evidence for incidental anchoring cannot be trusted at face value. Consistent with research practices in general, studies on incidental and subliminal anchoring suffer from publication bias that undermines the credibility of the published results. Unbiased replication studies and meta-analysis suggest that incidental anchoring effects are either very small or zero. Thus, there exists currently no empirical support for the notion that irrelevant numeric information can bias numeric judgments. More research on anchoring effects that corrects for publication bias is needed.

Why I care about replication studies

In 2009 I attended a European Social Cognition Network meeting in Poland. I only remember one talk from that meeting: A short presentation in a nearly empty room. The presenter was a young PhD student – Stephane Doyen. He discussed two studies where he tried to replicate a well-known finding in social cognition research related to elderly priming, which had shown that people walked more slowly after being subliminally primed with elderly related words, compared to a control condition.

His presentation blew my mind. But it wasn’t because the studies failed to replicate – it was widely known in 2009 that these studies couldn’t be replicated. Indeed, around 2007, I had overheard two professors in a corridor discussing the problem that there were studies in the literature everyone knew would not replicate. And they used this exact study on elderly priming as one example. The best solution the two professors came up with to correct the scientific record was to establish an independent committee of experts that would have the explicit task of replicating studies and sharing their conclusions with the rest of the world. To me, this sounded like a great idea.

And yet, in this small conference room in Poland, there was this young PhD student, acting as if we didn’t need specially convened institutions of experts to inform the scientific community that a study could not be replicated. He just got up, told us about how he wasn’t able to replicate this study, and sat down.

It was heroic.

If you’re struggling to understand why on earth I thought this was heroic, then this post is for you. You might have entered science in a different time. The results of replication studies are no longer communicated only face to face when running into a colleague in the corridor, or at a conference. But I was impressed in 2009. I had never seen anyone give a talk in which the only message was that an original effect didn’t stand up to scrutiny. People sometimes presented successful replications. They presented null effects in lines of research where the absence of an effect was predicted in some (but not all) tests. But I’d never seen a talk where the main conclusion was just: “This doesn’t seem to be a thing”.

On 12 September 2011 I sent Stephane Doyen an email. “Did you ever manage to publish some of that work? I wondered what has happened to it.” Honestly, I didn’t really expect that he would manage to publish these studies. After all, I couldn’t remember ever having seen a paper in the literature that was just a replication. So I asked, even though I did not expect he would have been able to publish his findings.

Surprisingly enough, he responded that the study would soon appear in press. I wasn’t fully aware of new developments in the publication landscape, where Open Access journals such as PlosOne published articles as long as the work was methodologically solid, and the conclusions followed from the data. I shared this news with colleagues, and many people couldn’t wait to read the paper: An article, in print, reporting the failed replication of a study many people knew to be not replicable. The excitement was not about learning something new. The excitement was about seeing replication studies with a null effect appear in print.

Regrettably, not everyone was equally excited. The publication also led to extremely harsh online comments from the original researcher about the expertise of the authors (e.g., suggesting that findings can fail to replicate due to “Incompetent or ill-informed researchers”), and the quality of PlosOne (“which quite obviously does not receive the usual high scientific journal standards of peer-review scrutiny”). This type of response happened again, and again, and again. Another failed replication led to a letter by the original authors that circulated over email among eminent researchers in the area, was addressed to the original authors, and ended with “do yourself, your junior co-authors, and the rest of the scientific community a favor. Retract your paper.”

Some of the historical record on discussions between researchers around between 2012-2015 survives online, in Twitter and Facebook discussions, and blogs. But recently, I started to realize that most early career researchers don’t read about the replication crisis through these original materials, but through summaries, which don’t give the same impression as having lived through these times. It was weird to see established researchers argue that people performing replications lacked expertise. That null results were never informative. That thanks to dozens of conceptual replications, the original theoretical point would still hold up even if direct replications failed. As time went by, it became even weirder to see that none of the researchers whose work was not corroborated in replication studies ever published a preregistered replication study to silence the critics. And why were there even two sides to this debate? Although most people agreed there was room for improvement and that replications should play some role in improving psychological science, there was no agreement on how this should work. I remember being surprised that a field was only now thinking about how to perform and interpret replication studies if we had been doing psychological research for more than a century.
 

I wanted to share this autobiographical memory, not just because I am getting old and nostalgic, but also because young researchers are most likely to learn about the replication crisis through summaries and high-level overviews. Summaries of history aren’t very good at communicating how confusing this time was when we lived through it. There was a lot of uncertainty, diversity in opinions, and lack of knowledge. And there were a lot of feelings involved. Most of those things don’t make it into written histories. This can make historical developments look cleaner and simpler than they actually were.

It might be difficult to understand why people got so upset about replication studies. After all, we live in a time where it is possible to publish a null result (e.g., in journals that only evaluate methodological rigor, but not novelty, journals that explicitly invite replication studies, and in Registered Reports). Don’t get me wrong: We still have a long way to go when it comes to funding, performing, and publishing replication studies, given their important role in establishing regularities, especially in fields that desire a reliable knowledge base. But perceptions about replication studies have changed in the last decade. Today, it is difficult to feel how unimaginable it used to be that researchers in psychology would share their results at a conference or in a scientific journal when they were not able to replicate the work by another researcher. I am sure it sometimes happened. But there was clearly a reason those professors I overheard in 2007 were suggesting to establish an independent committee to perform and publish studies of effects that were widely known to be not replicable.

As people started to talk about their experiences trying to replicate the work of others, the floodgates opened, and the shells fell off peoples’ eyes. Let me tell you that, from my personal experience, we didn’t call it a replication crisis for nothing. All of a sudden, many researchers who thought it was their own fault when they couldn’t replicate a finding started to realize this problem was systemic. It didn’t help that in those days it was difficult to communicate with people you didn’t already know. Twitter (which is most likely the medium through which you learned about this blog post) launched in 2006, but up to 2010 hardly any academics used this platform. Back then, it wasn’t easy to get information outside of the published literature. It’s difficult to express how it feels when you realize ‘it’s not me – it’s all of us’. Our environment influences which phenotypic traits express themselves. These experiences made me care about replication studies.

If you started in science when replications were at least somewhat more rewarded, it might be difficult to understand what people were making a fuss about in the past. It’s difficult to go back in time, but you can listen to the stories by people who lived through those times. Some highly relevant stories were shared after the recent multi-lab failed replication of ego-depletion (see tweets by Tom Carpenter and Dan Quintana). You can ask any older researcher at your department for similar stories, but do remember that it will be a lot more difficult to hear the stories of the people who left academia because most of their PhD consisted of failures to build on existing work.

If you want to try to feel what living through those times must have been like, consider this thought experiment. You attend a conference organized by a scientific society where all society members get to vote on who will be a board member next year. Before the votes are cast, the president of the society informs you that one of the candidates has been disqualified. The reason is that it has come to the society’s attention that this candidate selectively reported results from their research lines: The candidate submitted only those studies for publication that confirmed their predictions, and did not share studies with null results, even though these null results were well designed studies that tested sensible predictions. Most people in the audience, including yourself, were already aware of the fact that this person selectively reported their results. You knew publication bias was problematic from the moment you started to work in science, and the field knew it was problematic for centuries. Yet here you are, in a room at a conference, where this status quo is not accepted. All of a sudden, it feels like it is possible to actually do something about a problem that has made you feel uneasy ever since you started to work in academia.

You might live through a time where publication bias is no longer silently accepted as an unavoidable aspect of how scientists work, and if this happens, the field will likely have a very similar discussion as it did when it started to publish failed replication studies. And ten years later, a new generation will have been raised under different scientific norms and practices, where extreme publication bias is a thing of the past. It will be difficult to explain to them why this topic was a big deal a decade ago. But since you’re getting old and nostalgic yourself, you think that it’s useful to remind them, and you just might try to explain it to them in a 2 minute TikTok video.

History merely repeats itself. It has all been done before. Nothing under the sun is truly new.
Ecclesiastes 1:9

Thanks to Farid Anvari, Ruben Arslan, Noah van Dongen, Patrick Forscher, Peder Isager, Andrea Kis, Max Maier, Anne Scheel, Leonid Tiokhin, and Duygu Uygun for discussing this blog post with me (and in general for providing such a stimulating social and academic environment in times of a pandemic).