Reevaluating the Predictive Validity of the Race Implicit Association Test

Over the past two decades, social psychological research on prejudice has been dominated by the implicit cognition paradigm (Meissner, Grigutsch, Koranyi, Müller, & Rothermund, 2019). This paradigm is based on the assumption that many individuals of the majority group (e.g., White US Americans) have an automatic tendency to discriminate against members of a stigmatized minority group (e.g., African Americans). It is assumed that this tendency is difficult to control because many people are unaware of their prejudices.

The implicit cognition paradigm also assumes that biases vary across individuals of the majority group. The most widely used measure of individual differences in implicit biases is the race Implicit Association Test (rIAT; Greenwald, McGhee, & Schwartz, 1998). Like any other measure of individual differences, the race IAT has to meet psychometric criteria to be a useful measure of implicit bias. Unfortunately, the race IAT has been used in hundreds of studies before its psychometric properties were properly evaluated in a program of validation research (Schimmack, 2021a, 2021b).

Meta-analytic reviews of the literature suggest that the race IAT is not as useful for the study of prejudice as it was promised to be (Greenwald et al., 1998). For example, Meissner et al. (2019) concluded that “the predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1).

In response to criticism of the race IAT, Greenwald, Banaji, and Nosek (2015) argued that “statistically small effects of the implicit association test can have societally large effects” (p. 553). At the same time, Greenwald (1975) warned psychologists that they may be prejudiced against the null-hypothesis. To avoid this bias, he proposed that researchers should define a priori a range of effect sizes that are close enough to zero to decide in favor of the null-hypothesis. Unfortunately, Greenwald did not follow his own advice and a clear criterion for a small, but practically significant amount of predictive validity is lacking. This is a problem because estimates have decreased over time from r = .39 (McConnell & Leibold, 2001), to r = .24 in 2009 ( Greenwald, Poehlman, Uhlmann, and Banaji, 2009), to r = .148 in 2013 (Oswald, Mitchell, Blanton, Jaccard, & Tetlock (2013), and r = .097 in 2019 (Greenwald & Lai, 2020; Kurdi et al., 2019). Without a clear criterion value, it is not clear how this new estimate of predictive validity should be interpreted. Does it still provide evidence for a small, but practically significant effect, or does it provide evidence for the null-hypothesis (Greenwald, 1975)?

Measures are not Causes

To justify the interpretation of a correlation of r = .1 as small but important, it is important to revisit Greenwald et al.’s (2015) arguments for this claim. Greenwald et al. (2015) interpret this correlation as evidence for an effect of the race IAT on behavior. For example, they write “small effects can produce substantial discriminatory impact also by cumulating over repeated occurrences to the same person” (p. 558). The problem with this causal interpretation of a correlation between two measures is that scores on the race IAT have no influence on individuals’ behavior. This simple fact is illustrated in Figure 1. Figure 1 is a causal model that assumes the race IAT reflects valid variance in prejudice and prejudice influences actual behaviors (e.g., not voting for a Black political candidate). The model makes it clear that the correlation between scores on the race IAT (i.e., the iat box) and scores on a behavioral measures (i.e., the crit box) do not have a causal link (i.e., no path leads from the iat box to the crit box). Rather, the two measured variables are correlated because they both reflect the effect of a third variable. That is, prejudice influences race IAT scores and prejudice influences the variance in the criterion variable.

There is general consensus among social scientists that prejudice is a problem and that individual differences in prejudice have important consequences for individuals and society. The effect size of prejudice on a single behavior has not been clearly examined, but to the extent that race IAT scores are not perfectly valid measures of prejudice, the simple correlation of r = .1 is a lower limit of the effect size. Schimmack (2021) estimated that no more than 20% of the variance in race IAT scores is valid variance. With this validity coefficient, a correlation of r = .1 implies an effect of prejudice on actual behaviors of .1 / sqrt(.2) = .22.

Greenwald et al. (2015) correctly point out that effect sizes of this magnitude, r ~ .2, can have practical, real-world implications. The real question, however, is whether predictive validity of .1 justifies the use of the race IAT as a measure of prejudice. This question has to be evaluated in a comparison of predictive validity for the race IAT with other measures of prejudice. Thus, the real question is whether the race IAT has sufficient incremental predictive validity over other measures of prejudice. However, this question has been largely ignored in the debate about the utility of the race IAT (Greenwald & Lai, 2020; Greenwald et al., 2015; Oswald et al., 2013).

Kurdi et al. (2019) discuss incremental predictive validity, but this discussion is not limited to the race IAT and makes the mistake to correct for random measurement error. As a result, the incremental predictive validity for IATs of b = .14 is a hypothetical estimate for IATs that are perfectly reliable. However, it is well-known that IATs are far from perfectly reliable. Thus, this estimate overestimates the incremental predictive validity. Using Kurdi et al.’s data and limiting the analysis to studies with the race IAT, I estimated incremental predictive validity to be b = .08, 95%CI = .04 to .12. It is difficult to argue that this a practically significant amount of incremental predictive validity. At the very least, it does not justify the reliance on the race IAT as the only measure of prejudice or the claim that the race IAT is a superior measure of prejudice (Greenwald et al., 2009).

The meta-analytic estimate of b = .1 has to be interpreted in the context of evidence of substantial heterogeneity across studies (Kurdi et al., 2019). Kurdi et al. (2019) suggest that “it may be more appropriate to ask under what conditions the two [race IAT scores and criterion variables] are more or less highly correlated” (p. 575). However, little progress has been made in uncovering moderators of predictive validity. One possible explanation for this is that previous meta-analysis may have overlooked one important source of variation in effect sizes, namely publication bias. Traditional meta-analyses may be unable to reveal publication bias because they include many articles and outcome measures that did not focus on predictive validity. For example, Kurdi’s meta-analysis included a study by Luo, Li, Ma, Zhang, Rao, and Han (2015). The main focus of this study was to examine the potential moderating influence of oxytocin on neurological responses to pain expressions of Asian and White faces. Like many neurological studies, the sample size was small (N = 32), but the study reported 16 brain measures. For the meta-analysis, correlations were computed across N = 16 participants separately for two experimental conditions. Thus, this study provided as many effect sizes as it had participants. Evidently, power to obtain a significant result with N = 16 and r = .1 is extremely low, and adding these 32 effect sizes to the meta-analysis merely introduced noise. This may undermine the validity of meta-analytic results ((Sharpe, 1997). To address this concern, I conducted a new meta-analysis that differs from the traditional meta-analyses. Rather than coding as many effects from as many studies as possible, I only include focal hypothesis tests from studies that aimed to investigate predictive validity. I call this a focused meta-analysis.

Focused Meta-Analysis of Predictive Validity

Coding of Studies

I relied on Kurdi et al.’s meta-analysis to find articles. I selected only published articles that used the race IAT (k = 96). The main purpose of including unpublished studies is often to correct for publication bias (Kurdi et al., 2019). However, it is unlikely that only 14 (8%) studies that were conducted remained unpublished. Thus, the unpublished studies are not representative and may distort effect size estimates.

Coding of articles in terms of outcome measures that reflect discrimination yielded 60 studies in 45 articles. I examined whether this selection of studies influenced the results by limiting a meta-analysis with Kurdi et al.’s coding of studies to these 60 articles. The weighted average effect size was larger than the reported effect size, a = .167, se = .022, 95%CI = .121 to .212. Thus, Kurdi et al.’s inclusion of a wide range of studies with questionable criterion variables diluted the effect size estimate. However, there remained substantial variability around this effect size estimate using Kurdi et al.’s data, I2 = 55.43%.

Results

The focused coding produced one effect-size per study. It is therefore not necessary to model a nested structure of effect sizes and I used the widely used metafor package to analyze the data (Viechtbauer, 2010). The intercept-only model produced a similar estimate to the results for Kurdi et al.’s coding scheme, a = .201, se = .020, 95%CI = .171 to .249. Thus, focal coding does seem to produce the same effect size estimate as traditional coding. There was also a similar amount of heterogeneity in the effect sizes, I2 = 50.80%.

However, results for publication bias differed. Whereas Kurdi et al.’s coding shows no evidence of publication bias, focused coding produced a significant relationship emerged, b = 1.83, se = .41, z = 4.54, 95%CI = 1.03 to 2.64. The intercept was no longer significant, a = .014, se = .0462, z = 0.31, 95%CI = -.077 to 95%CI = .105. This would imply that the race IAT has no incremental predictive validity. Adding sampling error as a predictor reduced heterogeneity from I2 = 50.80% to 37.71%. Thus, some portion of the heterogeneity is explained by publication bias.

Stanley (2017) recommends to accept the null-hypothesis when the intercept in the previous model is not significant. However, a better criterion is to compare this model to other models. The most widely used alternative model regresses effect sizes on the squared sampling error (Stanley, 2017). This model explained more of the heterogeneity in effect sizes as reflected in a reduction of unexplained heterogeneity from 50.80% to 23.86%. The intercept for this model was significant, a = .113, se = .0232, z = 4.86, 95%CI = .067 to .158.

Figure 2 shows the effect sizes as a function of sampling error and the regression lines for the three models.

Inspection of Figure 1 provides further evidence that the squared-SE model. The red line (squared sampling error) fits the data better than the blue line (sampling error) model. In particular for large samples, PET underestimates effect sizes.

The significant relationship between sample size (sampling error) and effect sizes implies that large effects in small studies cannot be interpreted at face value. For example, the most highly cited study of predictive validity had only a sample size of N = 42 participants (McConnell & Leibold, 2001). The squared-sampling-error model predicts an effect size estimate of r = .30, which is close to the observed correlation of r = .39 in that study.

In sum, a focal meta-analysis replicates Kurdi et al.’s (2019) main finding that the average predictive validity of the race IAT is small, r ~ .1. However, the focal meta-analysis also produced a new finding. Whereas the initial meta-analysis suggested that effect sizes are highly variable, the new meta-analysis suggests that a large portion of this variability is explained by publication bias.

Moderator Analysis

I explored several potential moderator variables, namely (a) number of citations, (b) year of publication, (c) whether IAT effects were direct or moderator effects, (d) whether the correlation coefficient was reported or computed based on test statistics, and (e) whether the criterion was an actual behavior or an attitude measure. The only statistically significant result was a weaker correlation in studies that predicted a moderating effect of the race IAT, b = -.11, se = .05, z = 2.28, p = .032. However, the effect would not be significant after correction for multiple comparison and heterogeneity remained virtually unchanged, I2 = 27.15%.

During the coding of the studies, the article “Ironic effects of racial bias during interracial interactions” stood out because it reported a counter-intuitive result. in this study, Black confederates rated White participants with higher (pro-White) race IAT scores as friendlier. However, other studies find the opposite effect (e.g., McConnell & Leibold, 2001). If the ironic result was reported because it was statistically significant, it would be a selection effect that is not captured by the regression models and it would produce unexplained heterogeneity. I therefore also tested a model that excluded all negative effect. As bias is introduced by this selection, the model is not a test of publication bias, but it may be better able to correct for publication bias. The effect size estimate was very similar, a = .133, se = .017, 95%CI = .010 to .166. However, heterogeneity was reduced to 0%, suggesting that selection for significance fully explains heterogeneity in effect sizes.

In conclusion, moderator analysis did not find any meaningful moderators and heterogeneity was fully explained by publication bias, including publishing counterintuitive findings that suggest less discrimination by individuals with more prejudice. The finding that publication bias explains most of the variance is extremely important because Kurdi et al. (2019) suggested that heterogeneity is large and meaningful, which would suggest that higher predictive validity could be found in future studies. In contrast, the current results suggest that correlations greater than .2 in previous studies were largely due to selection for significance with small samples, which also explains unrealistically high correlations in neuroscience studies with the race IAT (cf. Schimmack, 2021b).

Predictive Validity of Self-Ratings

The predictive validity of self-ratings is important for several reasons. First, it provides a comparison standard for the predictive validity of the race IAT. For example, Greenwald et al. (2009) emphasized that predictive validity for the race IAT was higher than for self-reports. However, Kurdi et al.’s (2019) meta-analysis found the opposite. Another reason to examine the predictive validity of explicit measures is that implicit and explicit measures of racial attitudes are correlated with each other. Thus, it is important to establish the predictive validity of self-ratings to estimate the incremental predictive validity of the race IAT.

Figure 2 shows the results. The sampling-error model shows a non-zero effect size, but sampling error is large, and the confidence interval includes zero, a = .121, se = .117, 95%CI = -.107 to .350. Effect sizes are also extremely heterogeneous, I2 = 62.37%. The intercept for the squared-sampling-error model is significant, a = .176, se = .071, 95%CI = .036 to .316, but the model does not explain more of the heterogeneity in effect sizes than the squared-sampling-error model, I2 = 63.33%. To remain comparability, I use the squared-sampling error estimate. This confirms Kurdi et al.’s finding that self-ratings have slightly higher predictive validity, but the confidence intervals overlap. For any practical purposes, predictive validity of the race IAT and self-reports is similar. Repeating the moderator analyses that were conducted with the race IAT revealed no notable moderators.

Implicit-Explicit Correlations

Only 21 of the 60 studies reported information about the correlation between the race IAT and self-report measures. There was no indication of publication bias, and the effect size estimates of the three models converge on an estimate of r ~ .2 (Figure 3). Fortunately, this result can be compared with estimates from large internet studies (Axt, 2017) and a meta-analysis of implicit-explicit correlations (Hofmann et al., 2005). These estimates are a bit higher, r ~ .25. Thus, using an estimate of r = .2 is conservative for a test of the incremental predictive validity of the race IAT.

Incremental Predictive Validity

It is straightforward to estimate the incremental predictive validity of the race IAT and self-reports on the basis of the correlations between race IAT, self-ratings, and criterion variables. However, it is a bit more difficult to provide confidence intervals around these estimates. I used a simulated dataset with missing values to reproduce the correlations and sampling error of the meta-analysis. I then regressed, the criterion on the implicit and explicit variable. The incremental predictive validity for the race IAT was b = .07, se = .02, 95%CI = .03 to .12. This finding implies that the race IAT on average explains less than 1% unique variance in prejudice behavior. The incremental predictive validity of the explicit measure was b = .165, se = .03, 95%CI = .11 to .23. This finding suggests that explicit measures explain between 1 and 4 percent of the variance in prejudice behaviors.

Assuming that there is no shared method variance between implicit and explicit measures and criterion variables and that implicit and explicit measures reflect a common construct, prejudice, it is possible to fit a latent variable model to the correlations among the three indicators of prejudice (Schimmack, 2021). Figure 4 shows the model and the parameter estimates.

According to this model, prejudice has a moderate effect on behavior, b = .307, se = .043. This is consistent with general findings about effects of personality traits on behavior (Epstein, 1973; Funder & Ozer, 1983). The loading of the explicit variable on the prejudice factor implies that .582^2 = 34% of the variance in self-ratings of prejudice is valid variance. The loading of the implicit variable on the prejudice factor implies that .353^2 = 12% of the variance in race IAT scores is valid variance. Notably, similar estimates were obtained with structural equation models of data that are not included in this meta-analysis (Schimmack, 2021). Using data from Cunningham et al., (2001) I estimated .43^2 = 18% valid variance. Using Bar-Anan and Vianello (2018), I estimated .44^2 = 19% valid variance. Using data from Axt, I found .44^2 = 19% valid variance, but 8% of the variance could be attributed to group differences between African American and White participants. Thus, the present meta-analytic results are consistent with the conclusion that no more than 20% of the variance in race IAT scores reflects actual prejudice that can influence behavior.

In sum, incremental predictive validity of the race IAT is low for two reasons. First, prejudice has only modest effects on actual behavior in a specific situation. Second, only a small portion of the variance in race IAT scores is valid.

Discussion

In the 1990s, social psychologists embraced the idea that behavior is often influenced by processes that occur without conscious awareness. This assumption triggered the implicit revolution (Greenwald & Banaji, 2017). The implicit paradigm provided a simple explanation for low correlations between self-ratings of prejudice and implicit measures of prejudice, r ~ .2. Accordingly, many people are not aware how prejudice their unconscious is. The Implicit Association Test seemed to support this view because participants showed more prejudice on the IAT than on self-report measures. First studies of predictive validity also seemed to support this new model of prejudice (McConnell & Leibold, 2001), and the first meta-analysis suggested that implicit bias has a stronger influence on behavior than self-reported attitudes (Greenwald, Poehlman, Uhlmann, & Banaji, 2009, p. 17).

However, the following decade produced many findings that require a reevaluation of the evidence. Greenwald et al. (2009) published the largest test (N = 1057) of predictive validity. This study examined the ability of the race IAT to predict racial bias in the 2008 US presidential election. Although the race IAT was correlated with voting for McCain versus Obama, incremental predictive validity was close to zero and no longer significant when explicit measures were included in the regression model. Then subsequent meta-analyses produced lower estimates of predictive validity and it is no longer clear that predictive validity, especially incremental predictive validity, is high enough to reject the null-hypothesis. Although incremental predictive validity may vary across conditions, no conditions have been identified that show practically significant incremental predictive validity. Unfortunately, IAT proponents continue to make misleading statements based on single studies with small samples. For example, Kurdi et al. claimed that “effect sizes tend to be relatively large in studies on physician–patient interactions” (p. 583). However, this claim was based on a study with just 15 physicians, which makes it impossible to obtain precise effect size estimates about implicit bias effects for physicians.

Beyond Nil-Hypothesis Testing

Just like psychology in general, meta-analyses also suffer from the confusion of nil-hypothesis testing and null-hypothesis testing. The nil-hypothesis is the hypothesis that an effect size is exactly zero. Many methodologists have pointed out that it is rather silly to take the nil-hypothesis at face value because the true effect size is rarely zero (Cohen, 1994). The more important question is whether an effect size is sufficiently different from zero to be theoretically and practically meaningful. As pointed out by Greenwald (1975), effect size estimation has to be complemented with theoretical predictions about effect sizes. However, research on predictive validity of the race IAT lacks clear criteria to evaluate effect size estimates.

As noted in the introduction, there is agreement about the practical importance of statistically small effects for the prediction of discrimination and other prejudiced behaviors. The contentious question is whether the race IAT is a useful measure of dispositions to act prejudiced. Viewed from this perspective, focus on the race IAT is myopic. The real challenge is to develop and validate measures of prejudice. IAT proponents have often dismissed self-reports as invalid, but the actual evidence shows that self-reports have some validity that is at least equal to the validity of the race IAT. Moreover, even distinct self-report measures like the feeling thermometer and the symbolic racism have incremental predictive validity. Thus, prejudice researchers should use a multi-method approach. At present it is not clear that the race IAT can improve the measurement of prejudice (Greenwald et al., 2009; Schimmack, 2021a).

Methodological Implications

This article introduced a new type of meta-analysis. Rather than trying to find as many vaguely related studies and to code as many outcomes as possible, focused meta-analysis is limited to the main test of the key hypothesis. This approach has several advantages. First, the classic approach creates a large amount of heterogeneity that is unique to a few studies. This noise makes it harder to find real moderators. Second, the inclusion of vaguely related studies may dilute effect sizes. Third, the inclusion of non-focal studies may mask evidence of publication bias that is virtually present in all literatures. Finally, focal meta-analysis are much easier to do and can produce results much faster than the laborious meta-analyses that psychologists are used to. Even when classic meta-analysis exist, they often ignore publication bias. Thus, an important task for the future is to complement existing meta-analysis with focal meta-analysis to ensure that published effect sizes estimates are not diluted by irrelevant studies and not inflated by publication bias.

Prejudice Interventions

Enthusiasm about implicit biases has led to interventions that aim to reduce implicit biases. This focus on implicit biases in the real world needs to be reevaluated. First, there is no evidence that prejudice typically operates outside of awareness (Schimmack, 2021a). Second, individual differences in prejudice have only a modest impact on actual behaviors and are difficult to change. Not surprisingly, interventions that focus on implicit bias are not very infective. Rather than focusing on changing individuals’ dispositions, interventions may be more effective by changing situations. In this regard, the focus on internal factors is rather different from the general focus in social psychology on situational factors (Funder & Ozer, 1983). In recent years, it has become apparent that prejudice is often systemic. For example, police training may have a much stronger influence on racial disparities in fatal use of force than individual differences in prejudice of individual officers (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021).

Conclusion

The present meta-analysis of the race IAT provides further support for Meissner et al.’s (2019) conclusion that IATs “predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1). The present meta-analysis provides a quantitative estimate of b = .07. Although researchers can disagree about the importance of small effect sizes, I agree with Meissner that the gains from adding a race IAT to the measurement of prejudice is negligible. Rather than looking for specific contexts in which the race IAT has higher predictive validity, researchers should use a multi-method approach to measure prejudice. The race IAT may be included to further explore its validity, but there is no reason to rely on the race IAT as the single most important measure of individual differences in prejudice.

References

Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107–112.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., et al. (2019). Relationship between the implicit association test and intergroup behavior: a meta-analysis. American Psychologist. 74, 569–586. doi: 10.1037/amp0000364

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software36(3), 1–48. https://www.jstatsoft.org/v036/i03.

Incidental Anchoring Bites the Dust

Update: 6/10/21

After I posted this post, I learned about a published meta-analysis and new studies of incidental anchoring by David Shanks and colleagues that came to the same conclusion (Shanks et al., 2020).

Introduction

“The most expensive car in the world costs $5 million. How much does a new BMW 530i cost?”

According to anchoring theory, information about the most expensive car can lead to higher estimates for the cost of a BMW. Anchoring effects have been demonstrated in many credible studies since the 1970s (Kahneman & Tversky, 1973).

A more controversial claim is that anchoring effects even occur when the numbers are unrelated to the question and presented incidentally (Criticher & Gilovich, 2008). In one study, participants saw a picture of a football player and were asked to guess how likely it is that the player will sack the football player in the next game. The player’s number on jersey was manipulated to be 54 or 94. The study produced a statistically significant result suggesting that a higher number makes people give higher likelihood judgments. This study started a small literature on incidental anchoring effects. A variation on this them are studies that presented numbers so briefly on a computer screen that most participants did not actually see the numbers. This is called subliminal priming. Allegedly, subliminal priming also produced anchoring effects (Mussweiler & Englich (2005).

Since 2011, many psychologists are skeptical whether statistically significant results in published articles can be trusted. The reason is that researchers only published results that supported their theoretical claims even when the claims were outlandish. For example, significant results also suggested that extraverts can foresee where pornographic images are displayed on a computer screen even before the computer randomly selected the location (Bem, 2011). No psychologist, except Bem, believes these findings. More problematic is that many other findings are equally incredible. A replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2005). So, the question is whether incidental and subliminal anchoring are more like classic anchoring or more like extrasensory perception.

There are two ways to assess the credibility of published results when publication bias is present. One approach is to conduct credible replication studies that are published independent of the outcome of a study. The other approach is to conduct a meta-analysis of the published literature that corrects for publication bias. A recent article used both methods to examine whether incidental anchoring is a credible effect (Kvarven et al., 2020). In this article, the two approaches produced inconsistent results. The replication study produced a non-significant result with a tiny effect size, d = .04 (Klein et al., 2014). However, even with bias-correction, the meta-analysis suggested a significant, small to moderate effect size, d = .40.

Results

The data for the meta-analysis were obtained from an unpublished thesis (Henriksson, 2015). I suspected that the meta-analysis might have coded some studies incorrectly. Therefore, I conducted a new meta-analysis, using the same studies and one new study. The main difference between the two meta-analysis is that I coded studies based on the focal hypothesis test that was used to claim evidence for incidental anchoring. The p-values were then transformed into fisher-z transformed correlations and and sampling error, 1/sqrt(N – 3), based on the sample sizes of the studies.

Whereas the old meta-analysis suggested that there is no publication bias, the new meta-analysis showed a clear relationship between sampling error and effect sizes, b = 1.68, se = .56, z = 2.99, p = .003. Correcting for publication bias produced a non-significant intercept, b = .039, se = .058, z = 0.672, p = .502, suggesting that the real effect size is close to zero.

Figure 1 shows the regression line for this model in blue and the results from the replication study in green. We see that the blue and green lines intersect when sampling error is close to zero. As sampling error increases because sample sizes are smaller, the blue and green line diverge more and more. This shows that effect sizes in small samples are inflated by selection for significance.

However, there is some statistically significant variability in the effect sizes, I2 = 36.60%, p = .035. To further examine this heterogeneity, I conducted a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). A z-curve analysis converts p-values into z-statistics. The histogram of these z-statistics shows publication bias, when z-statistics cluster just above the significance criterion, z = 1.96.

Figure 2 shows a big pile of just significant results. As a result, the z-curve model predicts a large number of non-significant results that are absent. While the published articles have a 73% success rate, the observed discovery rate, the model estimates that the expected discovery rate is only 6%. That is, for every 100 tests of incidental anchoring, only 6 studies are expected to produce a significant result. To put this estimate in context, with alpha = .05, 5 studies are expected to be significant based on chance alone. The 95% confidence interval around this estimate includes 5% and is limited at 26% at the upper end. Thus, researchers who reported significant results did so based on studies with very low power and they needed luck or questionable research practices to get significant results.

A low discovery rate implies a high false positive risk. With an expected discovery rate of 6%, the false discovery risk is 76%. This is unacceptable. To reduce the false discovery risk, it is possible to lower the alpha criterion for significance. In this case, lowering alpha to .005 produces a false discovery risk of 5%. This leaves 5 studies that are significant.

One notable study with strong evidence, z = 3.70, examined anchoring effects for actual car sales. The data came from an actual auction of classic cars. The incidental anchors were the prices of the previous bid for a different vintage car. Based on sales data of 1,477 cars, the authors found a significant effect, b = .15, se = .04 that translates into a standardized effect size of d = .2 (fz = .087). Thus, while this study provides some evidence for incidental anchoring effects in one context, the effect size estimate is also consistent with the broader meta-analysis that effect sizes of incidental anchors are fairly small. Moreover, the incidental anchor in this study is still in the focus of attention and in some way related to the actual bid. Thus, weaker effects can be expected for anchors that are not related to the question at all (a player’s number) or anchors presented outside of awareness.

Conclusion

There is clear evidence that evidence for incidental anchoring cannot be trusted at face value. Consistent with research practices in general, studies on incidental and subliminal anchoring suffer from publication bias that undermines the credibility of the published results. Unbiased replication studies and meta-analysis suggest that incidental anchoring effects are either very small or zero. Thus, there exists currently no empirical support for the notion that irrelevant numeric information can bias numeric judgments. More research on anchoring effects that corrects for publication bias is needed.

Aber bitte ohne Sanna

Abstract

Social psychologists have failed to clean up their act and their literature. Here I show unusually high effect sizes in non-retracted articles by Sanna, who retracted several articles. I point out that non-retraction does not equal credibility and I show that co-authors like Norbert Schwarz lack any motivation to correct the published record. The inability of social psychologists to acknowledge and correct their mistakes renders social psychology a para-science that lacks credibility. Even meta-analyses cannot be trusted because they do not correct properly for the use of questionable research practices.

Introduction

When I grew up, a popular German Schlager was the song “Aber bitte mit Sahne.” The song is about Germans love of deserts with whipped cream. So, when I saw articles by Sanna, I had to think about whipped cream, which is delicious. Unfortunately, articles by Sanna are the exact opposite. In the early 2010s, it became apparent that Sanna had fabricated data. However, unlike the thorough investigation of a similar case in the Netherlands, the extent of Sanna’s fraud remains unclear (Retraction Watch, 2012). The latest count of Sanna’s retracted articles was 8 (Retraction Watch, 2013).

WebOfScience shows 5 retraction notices for 67 articles, which means 62 articles have not been retracted. The question is whether these article can be trusted to provide valid scientific information? The answer to this question matters because Sanna’s articles are still being cited at a rate of over 100 citations per year.

Meta-Analysis of Ease of Retrieval

The data are also being used in meta-analyses (Weingarten & Hutchinson, 2018). Fraudulent data are particularly problematic for meta-analysis because fraud can produce large effect size estimates that may inflate effect size estimates. Here I report the results of my own investigation that focusses on the ease-of-retrieval paradigm that was developed by Norbert Schwarz and colleagues (Schwarz et al., 1991).

The meta-analysis included 7 studies from 6 articles. Two studies produced independent effect size estimates for 2 conditions for a total of 9 effect sizes.

Sanna, L. J., Schwarz, N., & Small, E. M. (2002). Accessibility experiences and the hindsight bias: I knew it all along versus it could never have happened. Memory & Cognition, 30(8), 1288–1296. https://doi.org/10.3758/BF03213410 [Study 1a, 1b]

Sanna, L. J., Schwarz, N., & Stocker, S. L. (2002). When debiasing backfires: Accessible content and accessibility experiences in debiasing hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(3), 497–502. https://doi.org/10.1037/0278-7393.28.3.497
[Study 1 & 2]

Sanna, L. J., & Schwarz, N. (2003). Debiasing the hindsight bias: The role of accessibility experiences and (mis)attributions. Journal of Experimental Social Psychology, 39(3), 287–295. https://doi.org/10.1016/S0022-1031(02)00528-0 [Study 1]

Sanna, L. J., Chang, E. C., & Carter, S. E. (2004). All Our Troubles Seem So Far Away: Temporal Pattern to Accessible Alternatives and Retrospective Team Appraisals. Personality and Social Psychology Bulletin, 30(10), 1359–1371. https://doi.org/10.1177/0146167204263784
[Study 3a]

Sanna, L. J., Parks, C. D., Chang, E. C., & Carter, S. E. (2005). The Hourglass Is Half Full or Half Empty: Temporal Framing and the Group Planning Fallacy. Group Dynamics: Theory, Research, and Practice, 9(3), 173–188. https://doi.org/10.1037/1089-2699.9.3.173 [Study 3a, 3b]

Carter, S. E., & Sanna, L. J. (2008). It’s not just what you say but when you say it: Self-presentation and temporal construal. Journal of Experimental Social Psychology, 44(5), 1339–1345. https://doi.org/10.1016/j.jesp.2008.03.017 [Study 2]

When I examined Sanna’s results, I found that all 9 of these 9 effect sizes were extremely large with effect size estimates being larger than one standard deviation. A logistic regression analysis that predicted authorship (With Sanna vs. Without Sanna) showed that the large effect sizes in Sanna’s articles were unlikely to be due to sampling error alone, b = 4.6, se = 1.1, t(184) = 4.1, p = .00004 (1 / 24,642).

These results show that Sanna’s effect sizes are not typical for the ease-of-retrieval literature. As one of his retracted articles used the ease-of retrieval paradigm, it is possible that these articles are equally untrustworthy. As many other studies have investigated ease-of-retrieval effects, it seems prudent to exclude articles by Sanna from future meta-analysis.

These articles should also not be cited as evidence for specific claims about ease-of-retrieval effects for the specific conditions that were used in these studies. As the meta-analysis shows, there have been no credible replications of these studies and it remains unknown how much ease of retrieval may play a role under the specified conditions in Sanna’s articles.

Discussion

The blog post is also a warning for young scientists and students of social psychology that they cannot trust researchers who became famous with the help of questionable research practices that produced too many significant results. As the reference list shows, several articles by Sanna were co-authored by Norbert Schwarz, the inventor of the ease-of-retrieval paradigm. It is most likely that he was unaware of Sanna’s fraudulent practices. However, he seemed to lack any concerns that the results might be too good to be true. After all, he encountered replicaiton failures in his own lab.

of course, we had studies that remained unpublished. Early on we experimented with different manipulations. The main lesson was: if you make the task too blatantly difficult, people correctly conclude the task is too difficult and draw no inference about themselves. We also had a couple of studies with unexpected gender differences” (Schwarz, email communication, 5/18,21).

So, why was he not suspicious when Sanna only produced successful results? I was wondering whether Schwarz had some doubts about these studies with the help of hindsight bias. After all, a decade or more later, we know that he committed fraud for some articles on this topic, we know about replication failures in larger samples (Yeager et al., 2019), and we know that the true effect sizes are much smaller than Sanna’s reported effect sizes (Weingarten & Hutchinson, 2018).

Hi Norbert, 
   thank you for your response. I am doing my own meta-analysis of the literature as I have some issues with the published one by Evan. More about that later. For now, I have a question about some articles that I came across, specifically Sanna, Schwarz, and Small (2002). The results in this study are very strong (d ~ 1).  Do you think a replication study powered for 95% power with d = .4 (based on meta-analysis) would produce a significant result? Or do you have concerns about this particular paradigm and do not predict a replication failure?
Best, Uli (email

His response shows that he is unwilling or unable to even consider the possibility that Sanna used fraud to produce the results in this article that he co-authored.

Uli, that paper has 2 experiments, one with a few vs many manipulation and one with a facial manipulation.  I have no reason to assume that the patterns won’t replicate. They are consistent with numerous earlier few vs many studies and other facial manipulation studies (introduced by Stepper & Strack,  JPSP, 1993). The effect sizes always depend on idiosyncracies of topic, population, and context, which influence accessible content and accessibility experience. The theory does not make point predictions and the belief that effect sizes should be identical across decades and populations is silly — we’re dealing with judgments based on accessible content, not with immutable objects.  

This response is symptomatic of social psychologists response to decades of research that has produced questionable results that often fail to replicate (see Schimmack, 2020, for a review). Even when there is clear evidence of questionable practices, journals are reluctant to retract articles that make false claims based on invalid data (Kitayama, 2020). And social psychologist Daryl Bem wants rather be remembered as loony para-psychologists than as real scientists (Bem, 2021).

The problem with these social psychologists is not that they made mistakes in the way they conducted their studies. The problem is their inability to acknowledge and correct their mistakes. While they are clinging to their CVs and H-Indices to protect their self-esteem, they are further eroding trust in psychology as a science and force junior scientists who want to improve things out of academia (Hilgard, 2021). After all, the key feature of science that distinguishes it from ideologies is the ability to correct itself. A science that shows no signs of self-correction is a para-science and not a real science. Thus, social psychology is currently para-science (i.e., “Parascience is a broad category of academic disciplines, that are outside the scope of scientific study, Wikipedia).

The only hope for social psychology is that young researchers are unwilling to play by the old rules and start a credibility revolution. However, the incentives still favor conformists who suck up to the old guard. Thus, it is unclear if social psychology will ever become a real science. A first sign of improvement would be to retract articles that make false claims based on results that were produced with questionable research practices. Instead, social psychologists continue to write review articles that ignore the replication crisis (Schwarz & Strack, 2016) as if repression can bend reality.

Nobody should believe them.

Justify Your Alpha by Decreasing Alpha Levels as a Function of the Sample Size

A preprint (“Justify Your Alpha: A Primer on Two Practical Approaches”) that extends and improves the ideas in this blog post is available at: https://psyarxiv.com/ts4r6  
 
Testing whether observed data should surprise us, under the assumption that some model of the data is true, is a widely used procedure in psychological science. Tests against a null model, or against the smallest effect size of interest for an equivalence test, can guide your decisions to continue or abandon research lines. Seeing whether a p-value is smaller than an alpha level is rarely the only thing you want to do, but especially early on in experimental research lines where you can randomly assign participants to conditions, it can be a useful thing.

Regrettably, this procedure is performed rather mindlessly. Doing Neyman-Pearson hypothesis testing well, you should carefully think about the error rates you find acceptable. How often do you want to miss the smallest effect size you care about, if it is really there? And how often do you want to say there is an effect, but actually be wrong? It is important to justify your error rates when designing an experiment. In this post I will provide one justification for setting the alpha level (something we recommended makes more sense than using a fixed alpha level).

Papers explaining how to justify your alpha level are very rare (for an example, see Mudge, Baker, Edge, & Houlahan, 2012). Here I want to discuss one of the least known, but easiest suggestions on how to justify alpha levels in the literature, proposed by Good. The idea is simple, and has been supported by many statisticians in the last 80 years: Lower the alpha level as a function of your sample size.

The idea behind this recommendation is most extensively discussed in a book by Leamer (1978, p. 92). He writes:

The rule of thumb quite popular now, that is, setting the significance level arbitrarily to .05, is shown to be deficient in the sense that from every reasonable viewpoint the significance level should be a decreasing function of sample size.

Leamer (you can download his book for free) correctly notes that this behavior, an alpha level that is a decreasing function of the sample size, makes sense from both a Bayesian as a Neyman-Pearson perspective. Let me explain.

Imagine a researcher who performs a study that has 99.9% power to detect the smallest effect size the researcher is interested in, based on a test with an alpha level of 0.05. Such a study also has 99.8% power when using an alpha level of 0.03. Feel free to follow along here, by setting the sample size to 204, the effect size to 0.5, alpha or p-value (upper limit) to 0.05, and the p-value (lower limit) to 0.03.

We see that if the alternative hypothesis is true only 0.1% of the observed studies will, in the long run, observe a p-value between 0.03 and 0.05. When the null-hypothesis is true 2% of the studies will, in the long run, observe a p-value between 0.03 and 0.05. Note how this makes p-values between 0.03 and 0.05 more likely when there is no true effect, than when there is an effect. This is known as Lindley’s paradox (and I explain this in more detail in Assignment 1 in my MOOC, which you can also do here).

Although you can argue that you are still making a Type 1 error at most 5% of the time in the above situation, I think it makes sense to acknowledge there is something weird about having a Type 1 error of 5% when you have a Type 2 error of 0.1% (again, see Mudge, Baker, Edge, & Houlahan, 2012, who suggest balancing error rates). To me, it makes sense to design a study where error rates are more balanced, and a significant effect is declared for p-values more likely to occur when the alternative model is true than when the null model is true.

Because power increases as the sample size increases, and because Lindley’s paradox (Lindley, 1957, see also Cousins, 2017) can be prevented by lowering the alpha level sufficiently, the idea to lower the significance level as a function of the sample is very reasonable. But how?

Zellner (1971) discusses how the critical value for a frequentist hypothesis test approaches a limit as the sample size increases (i.e., a critical value of 1.96 for p = 0.05 in a two-sided test) whereas the critical value for a Bayes factor increases as the sample size increases (see also Rouder, Speckman, Sun, Morey, & Iverson, 2009). This difference lies at the heart of Lindley’s paradox, and under certain assumptions comes down to a factor of ?n. As Zellner (1971, footnote 19, page 304) writes (K01 is the formula for the Bayes factor):

If a sampling theorist were to adjust his significance level upward as n grows larger, which seems reasonable, za would grow with n and tend to counteract somewhat the influence of the ?n factor in the expression for K01.

Jeffreys (1939) discusses Neyman and Pearson’s work and writes:

We should therefore get the best result, with any distribution of ?, by some form that makes the ratio of the critical value to the standard error increase with n. It appears then that whatever the distribution may be, the use of a fixed P limit cannot be the one that will make the smallest number of mistakes.

He discusses the issue more in Appendix B, where he compared his own test (Bayes factors) against Neyman-Pearson decision procedures, and he notes that:

In spite of the difference in principle between my tests and those based on the P integrals, and the omission of the latter to give the increase of the critical values for large n, dictated essentially by the fact that in testing a small departure found from a large number of observations we are selecting a value out of a long range and should allow for selection, it appears that there is not much difference in the practical recommendations. Users of these tests speak of the 5 per cent. point in much the same way as I should speak of the K = 10 point, and of the 1 per cent. point as I should speak of the K = I0-1 point; and for moderate numbers of observations the points are not very different. At large numbers of observations there is a difference, since the tests based on the integral would sometimes assert significance at departures that would actually give K > I. Thus there may be opposite decisions in such cases. But they will be very rare.

So even though extremely different conclusions between Bayes factors and frequentist tests will be rare, according to Jeffreys, when the sample size grows, the difference becomes noticeable.

This brings us to Good’s (1982) easy solution. His paper is basically just a single page (I’d love something akin to a Comments, Conjectures, and Conclusions format in Meta-Psychology! – note that Good himself was the section editor, which started with ‘Please be succinct but lucid and interesting’, and it reads just like a blog post).

He also explains the rationale in Good (1992):

‘we have empirical evidence that sensible P values are related to weights of evidence and, therefore, that P values are not entirely without merit. The real objection to P values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.

Based on the observation by Jeffrey’s (1939) that, under specific circumstances, the Bayes factor against the null-hypothesis is approximately inversely proportional to ?N, Good (1982) suggests a standardized p-value to bring p-values in closer relationship with weights of evidence:

This formula standardizes the p-value to the evidence against the null hypothesis that what would be found if the pstan-value was the tail area probability observed in a sample of 100 participants (I think the formula is only intended for between designs – I would appreciate anyone weighing in in the comments if it can be extended to within-designs). When the sample size is 100, the p-value and pstan are identical. But for larger sample sizes pstan is larger than p. For example, a p = .05 observed in a sample size of 500 would have a pstan of 0.11, which is not enough to reject the null-hypothesis for the alternative. Good (1988) demonstrates great insight when he writes: ‘I guess that standardized p-values will not become standard before the year 2000.’


Good doesn’t give a lot of examples of how standardized p-values should be used in practice, but I guess it makes things easier to think about a standardized alpha level (even though the logic is the same, just like you can double the p-value, or halve the alpha level, when you are correcting for 2 comparisons in a Bonferroni correction). So instead of an alpha level of 0.05, we can think of a standardized alpha level:
Again, with 100 participants ? and ?stan are the same, but as the sample size increases above 100, the alpha level becomes smaller. For example, a ? = .05 observed in a sample size of 500 would have a ?stan of 0.02236.

So one way to justify your alpha level is by using a decreasing alpha level as the sample size increases. I for one have always thought it was rather nonsensical to use an alpha level of 0.05 in all meta-analyses (especially when testing a meta-analytic effect size based on thousands of participants against zero), or large collaborative research project such as Many Labs, where analyses are performed on very large samples. If you have thousands of participants, you have extremely high power for most effect sizes original studies could have detected in a significance test. With such a low Type 2 error rate, why keep the Type 1 error rate fixed at 5%, which is so much larger than the Type 2 error rate in these analyses? It just doesn’t make any sense to me. Alpha levels in meta-analyses or large-scale data analyses should be lowered as a function of the sample size. In case you are wondering: an alpha level of .005 would be used when the sample size is 10.000.

When designing a study based on a specific smallest effect size of interest, where you desire to have decent power (e.g., 90%), we run in to a small challenge because in the power analysis we now have two unknowns: The sample size (which is a function of the power, effect size, and alpha), and the standardized alpha level (which is a function of the sample size). Luckily, this is nothing that some R-fu can’t solve by some iterative power calculations. [R code to calculate the standardized alpha level, and perform an iterative power analysis, is at the bottom of the post]

When we wrote Justify Your Alpha (I recommend downloading the original draft before peer review because it has more words and more interesting references) one of the criticism I heard the most is that we gave no solutions how to justify your alpha. I hope this post makes it clear that statisticians have discussed that the alpha level should not be any fixed value even since it was invented. There are already some solutions available in the literature. I like Good’s approach because it is simple. In my experience, people like simple solutions. It might not be a full-fledged decision theoretical cost-benefit analysis, but it beats using a fixed alpha level. I recently used it in a submission for a Registered Report. At the same time, I think it has never been used in practice, so I look forward to any comments, conjectures, and conclusions you might have. 

References

Good, I. J. (1982). C140. Standardized tail-area probabilities. Journal of Statistical Computation and Simulation, 16(1), 65–66. https://doi.org/10.1080/00949658208810607
Good, I. J. (1988). The interface between statistics and philosophy of science. Statistical Science, 3(4), 386–397.
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: A Brief Review. Journal of the American Statistical Association, 87(419), 597. https://doi.org/10.2307/2290192
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2, 168–171. https://doi.org/10.1038/s41562-018-0311-x
Leamer, E. E. (1978). Specification Searches: Ad Hoc Inference with Nonexperimental Data (1 edition). New York usw.: Wiley.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an Optimal ? That Minimizes Errors in Null Hypothesis Significance Tests. PLOS ONE, 7(2), e32734. https://doi.org/10.1371/journal.pone.0032734
Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. https://doi.org/10.3758/PBR.16.2.225
Zellner, A. (1971). An introduction to Bayesian inference in econometrics. New York: Wiley.