Replicability Report 2023: Aggressive Behavior

This report was created in collaboration with Anas Alsayed Hasan.
Citation: Hasan, A.A. & Schimmack, U. (2023). Replicability Report 2023: Aggressive Behavior. Replicationindex.com

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Aggressive Behavior

Aggressive Behavior is the official journal of the International Society for Research on Aggression.  Founded in 1974, this journal provides a multidisciplinary view of aggressive behavior and its physiological and behavioral consequences on subjects.  Published articles use theories and methods from psychology, psychiatry, anthropology, ethology, and more. So far, Aggressive Behavior has published close to 2,000 articles. Nowadays, it publishes about 60 articles a year in 6 annual issues. The journal has been cited by close to 5000 articles in the literature and has an H-Index of 104 (i.e., 104 articles have received 104 or more citations). The journal also has a moderate impact factor of 3. This journal is run by an editorial board containing over 40 members. The Editor-In-Chief is Craig Anderson. The associate editors are Christopher Barlett, Thomas Denson, Ann Farrell, Jane Ireland, and Barbara Krahé.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 71%, the expected discovery rate is 45%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result. An EDR of 45% implies that no more than 7% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 12%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. The ERR of 69% suggests that the majority of results published in Aggressive Behavior are replicable, but the EDR allows for a replication rate as low as 45%. Thus, replicability is estimated to range from 45% to 69%. There are currently no large replication studies in this field, making it difficult to compare these estimates to outcomes of empirical replication studies. However, the ERR for the OSC reproducibility project that produced 36% successful actual replications was around 60%, suggesting that roughly 50% of actual replication studies of articles in this journal would be significant. It is unlikely that the success rate would be lower than the EDR of 45%. Given the relatively low risk of type-I errors, most of these replication failures are likely to occur because studies in this journal tend to be underpowered. Thus, replication studies should use larger samples.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The ODR, EDR, and ERR were regressed on time and time-squared to allow for non-linear relationships. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.52 percentage points per year (SE = .22). The EDR showed no significant trends, p > .30. There were no linear or quadratic time trends for the ERR, p > .10. Figure 2 shows the ODR and EDR to examine selection bias.

Figure 2

The decrease in the ODR implies that selection bias is decreasing over time. In the last years, the confidence intervals for the ODR and EDR overlap, indicating that there are no longer statistically reliable differences. However, this does not imply that all results are being reported. The main reason for the overlap is the low certainty about the annual EDR. Given the lack of a significant time trend for the EDR, the average EDR across all years implies that there is still selection bias. Finally, automatically extracted test-statistics make it impossible to say whether researchers are reporting more focal or non-focal results as non-significant. To investigate this question, it is necessary to hand-code focal tests (see Limitation section).

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

Figure 3

The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 31% and the FDR is 7%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present, and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Figure 4

Using alpha = .01 lowers the discovery rate by about 15 percentage points. The stringent criterion of alpha = .001 lowers it by another 10 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Figure 5 shows the effects of alpha on the false positive risk. Even alpha = .01 is sufficient to ensure a false positive risk of 5% or less. Thus, alpha = .01 seems a reasonable criterion to avoid too many false positive results without discarding too many true positive results. Authors may want to increase statistical power to increase their chances of obtaining a p-value below .01 when their hypotheses are true to produce credible evidence for their hypotheses.

Figure 5

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

Hand-coding of other journals shows that publications of non-significant focal hypothesis tests are still rare. As a result, the ODR for focal hypothesis tests in Aggressive Behavior is likely to be higher and selection bias larger than the present results suggest. Hand-coding of a representative sample of articles in this journal is needed.

Conclusion

The replicability report for Aggressive Behavior shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analyses show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.

Replicability Report 2023: Cognition & Emotion

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Cognition & Emotion

The study of emotions largely disappeared from psychology after the second world war and during the rain of behaviorism or was limited to facial expressions. The study of emotional experiences reemerged in the 1980. Cognition & Emotion was established in 1987 as an outlet for this research.

So far, the journal has published close to 3,000 articles. The average number of citations per article is 46. The journal has an H-Index of 155 (i.e., 155 articles have 155 or more citations). These statistics show that Cognition & Emotion is an influential journal for research on emotions.

Nine articles have more than 1,000 citations. The most highly cited article is a theoretical article by Paul Ekman arguing for basic emotions (Ekman, 1992);

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 68%, the expected discovery rate is 34%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 34% implies that up to 10% of the significant results could be false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in this journal need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 70% suggests that most results published in this journal are replicable, but the EDR allows for a replication rate as low as 34%. Thus, replicability is estimated to range from 34% to 70%. There is no representative sample of replication studies from this journal to compare this estimate with the outcome of actual replication studies. However, a journal with lower ERR and EDR estimates, Psychological Science, had an actual replication rate of 41%. Thus, it is plausible to predict a higher actual replication rate than this for Cognition & Emotion.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.79, percentage points per year (SE = .10). The EDR showed no significant linear, b = .23, SE = .41, or non-linear, b = -.10, SE = .07, trends.

Figure 2

The decreasing ODR implies that selection bias is decreasing, but it is not clear whether this trend also applies to focal hypothesis tests (see limitations section). The lack of an increase in the EDR implies that researchers continue to conduct studies with low statistical power and that the non-significant results often remain unpublished. To improve credibility of this journal, editors could focus on power rather than statistical significance in the review process.

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

Figure 3

There was a significant linear, b = .24, SE = .11, tend for the ERR, indicating an increase in the ERR. The increase in the ERR implies fewer replication failures in the later years. However, because the FDR is not decreasing, a larger portion of these replication failures could be false positives.

Retrospective Improvement of Credibility

he criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Figure 4

Lowering alpha to .01 reduces the observed discovery rate by 20 to 30 percentage points. It is also interesting that the ODR decreases more with alpha = .05 than for other alpha levels. This suggests that changes in the ODR are in part caused by fewer p-values between .05 and .01. These significant results are more likely to result from unscientific methods and are often do not replicate.

Figure 5 shows the effects of alpha on the false positive risk. Lowering alpha to .01 reduces the false positive risk to less than 5%. Thus, readers can use this criterion to reduce the false positive risk to an acceptable level.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Cognition & Emotion a small set of articles were hand-coded as part of a study on the effects of open science reforms on the credibility of psychological science. Figure 6 shows the z-curve plot and results for 117 focal hypothesis tests.

Figure 6

The main difference between manually and automatically coded data is a much higher ODR (95%) for manually coded data. This finding shows that selection bias for focal hypothesis tests is much more severe than the automatically extracted data suggest.

The point estimate of the EDR, 37%, is similar to the EDR for automatically extracted data, 34%. However, due to the small sample size, the 95%CI for manually coded data is wide and it is impossible to draw firm conclusions about the EDR, but results from other journals and large samples also show similar results.

The ERR estimates are also similar and the 95%CI for hand-coded data suggests that the majority of results are replicable.

Overall, these results suggest that automatically extracted results are informative, but underestimate selection bias for focal hypothesis tests.

Conclusion

The replicability report for Cognition & Emotion shows clear evidence of selection bias, but also a relatively low risk of false positive results that can be further reduced by using alpha = .01 as a criterion to reject the null-hypothesis. There are no notable changes in credibility over time. Editors of this journal could improve credibility by reducing selection bias. The best way to do so would be to evaluate the strength of evidence rather than using alpha = .05 as a dichotomous criterion for acceptance. Moreover, the journal needs to publish more articles that fail to support theoretical predictions. The best way to do so is to accept articles that preregistered predictions and failed to confirm them or to invite registered reports that publish articles independent of outcome of a study. Readers can set their own level of alpha depending on their appetite for risk, but alpha = .01 is a reasonable criterion because it (a) maintains a false positive risk below 5%, and eliminates p-values between .01 and .05 that are often obtained with unscientific practices and fail to replicate.

Link to replicability reports for other journals.

Replicability Report 2023: Acta Psychologica

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Acta Psychologica

Acta Psychologica is an old psychological journal that was founded in 1936. The journal publishes articles from various areas of psychology, but cognitive psychological research seems to be the most common area.

So far, Acta Psychologica has published close to 6,000 articles. Nowadays, it publishes about 150 articles a year in 10 annual issues. Over the past 30 years, articles have an average citation rate of 24.48 citations, and the journal has an H-Index of 116 (i.e., 116 articles have received 116 or more citations). The journal has an impact factor of 2 which is typical of most empirical psychology journals.

So far, the journal has published 4 articles with more than 1,000 citations, but all of these articles were published in the 1960s and 1970s. The most highly cited article in the 2000s, examined the influence of response categories on the psychometric properties of survey items (Preston & Colman, 2000; 947 citations).

Given the multidisciplinary nature of the journal, the journal has a team of editors. The current editors are Mohamed Alansari, Martha Arterberry, Colin Cooper, Martin Dempster, Tobias Greitemeyer, Matthieu Guitton, and Nhung T Hendy.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1: Z-Curve Plot

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).

A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 73%, the expected discovery rate is 44%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in Acta Psychologica.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 44% implies that no more than 5% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 9%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 72% suggests that the majority of results published in Acta Psychologica is replicable, but the EDR allows for a replication rate as low as 44%. Thus, replicability is estimated to range from 44% to 72%. Actual replications of cognitive research suggest that 50% of results produce a significant result again (Open Science Collaboration, 2015). Taking the low false positive risk into account, most replication failures are likely to be false negatives due to insufficient power in the original and replication studies.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-signifiant results, b = -.62 percentage points per year (SE = .13). The EDR showed no significant trends, p > .20.

Figure 2

The decrease in the ODR implies that selection bias is decreasing over time. Despite the lack of a significant time trend for the EDR, the last years show an increase and in 2022, there is no evidence of selection bias. This trend may reflect changes in editorial practices in response to the open science movement, but at present the evidence is inconclusive.

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

There were no time trends in the ERR (time trend for the FDR is implied by tests of the EDR). Tus, the estimates for the full year (Figure 1) can be used to get more precise estimates. The same is true for the comparison of the expected failure rate and the FD. Here precise estimates are valuable because FDR estimates are imprecise. Based on the results in Figure 1, the EFR is 27% and the FDR is 7%. This suggests that replication failures are more likely to be false negatives. Importantly, this conclusion only holds if the replication study had the same power as the original study. Replication studies with much larger sample sizes overpower the results of original studies because they provide much better estimates of population effect sizes.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the false positive risk and the observed discovery rate (lower alpha implies fewer significant results).

The results show that even reducing alpha to .01 reduces the false discovery risk considerably. This finding is consistent with evidence that results with p-values between .05 and .01 often fail to replicate. Setting alpha to .005, as suggested by several authors, ensures a false positive risk below 5% even during the dark period of Psychological Science.

Due to the improvement, alpha = .01 achieves the same goal in the later years without discarding as many results as the .005 criterion would.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Psychological Science, hand-coded data are available from coding by Motyl et al. (2017) and my own lab. The datasets were combined and analyzed with z-curve (Figure 4).

The ODR of 84% is higher than the ODR of 68% for automatic extraction. The EDR of 34% is identical to the estimate for automatic extraction. The ERR of 61% is 8 percentage points lower than the ERR for automatic extraction. Given the period effects on z-curve estimates, I also conducted a z-curve analysis for automatically extracted tests for the matching years (2003, 2004, 2010, 2016, 2020). The results were similar, ODR = 73%, EDR = 25%, and ERR = 64%. Thus, automatically extracted results produce similar results to results based on handcoded data. The main difference is that non-significant results are less likely to be focal tests.

Conclusion

The replicability report for Psychological Science shows (a) clear evidence of selection bias, (b) unacceptably high false positive risks at the conventional criterion for statistical significance, and modest replicability. However, time trend analyses show that credibility of published results decreased in the beginning of this century, but improved since 2015. Further improvements are needed to eliminate selection bias and increase the expected discovery rate by increasing power (reducing sampling error). Reducing sampling error is also needed to produce strong evidence against theoretical predictions that are important for theory development. The present results can be used as benchmark for further improvements that can increase the credibility of results in psychological science (e.g., more Registered Reports that publish results independent of outcomes). The results can also help readers of psychological science to chose significance criteria that match their personal preferences for risk and their willingness to “err on the side of discovery” (Bem, 2004).

Replicability Reports of Psychology Journals

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Research reports use z-curve to provide information about psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

List of Journals with Links to Replicability Report

Psychological Science (2000-2022)

Acta Psychologica (2000-2022)

Replicability Report 2023: Psychological Science

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Research reports use z-curve to provide information about psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Psychological Science

Psychological Science is often called the flagship journal of the Association for Psychological Science. It publishes journals from all areas of psychology, but most articles are experimental studies.

The journal started in 1990. So far, it has published over 5,000 articles with an average citation rate of 90 citations per article. The journal currently has an H-Index of 300 (i.e., 300 articles have received 300 or more citations).

Ironically, the most cited article (3,800 citations) is a theoretical article that illustrated how easy it is to produce statistically significant results with statistical tricks that capitalize on chance and increase the risk of a false discovery and inflate effect size estimates (Simmons, Nelson, & Simmonsoh, 2011). This article is often cited as evidence that published results lack credibility. The impact of this journal also suggests that most researchers are now aware that selective publishing of significant results is harmful.

Stephen Lindsay was the editor of Psychological Science from 2015 to 2019. During his tenure, Psychological Science introduced open-science badges to improve the replicability of published results.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1 shows a z-curve plot for all articles from 1997-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the nil-hypothesis that there is no statistical relationship (i.e., the effect size is zero and the expected z-score is zero).

A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of evidence that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.

Z-curve fits a statistical model to the distribution of these z-scores (Figure 1). The grey curve shows the predicted distribution of z-scores. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results.

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 68%, the expected discovery rate is 34%. Thus, there are roughly two times more significant results than one would expect based on the distribution of significant results. Improvement would require an increase in the EDR, a decrease in the ODR, or both.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 34% implies that no more than 10% of the significant results are false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in Psychological Science need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 67% suggests that the majority of results published in Psychological Science is replicable, compared with an estimate of 41% based on actual replications (Open Science Collaboration, 2015). In contrast, the estimate based on actual replications is higher than the EDR of 34%. Thus, replicability is estimated to range from 34% to 67%.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The credibility statistics (ODR, EDR, ERR) are plotted in Figure 2 with 95% confidence intervals based on a regression model with linear and quadratic predictors. The ODR showed a significant linear trend, indicating more publications of non-signifiant results, b = -.45 percentage points per year (SE = .09). ERR and ODR showed significant quadratic trends, b = .28, SE = .05 and b = .13, SE = .02, respectively. Figure 2 shows that credibility (EDR, ERR) decreased in the first decade of this century and then increased in the wake of the replication crisis. The positive trend since 2015 can be attributed to the reforms by Stephen Lindsay that were maintained by the current editor Patricia Bauer.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

For the dark period from 2005 to 2015, when the EDR could have been as low as 10% and the false discovery risk as high as 40%. This level is unacceptably high. Figure 3 shows the false discovery risk for different levels of alpha.

The results show that even reducing alpha to .01 reduces the false discovery risk considerably. This finding is consistent with evidence that results with p-values between .05 and .01 often fail to replicate. Setting alpha to .005, as suggested by several authors, ensures a false positive risk below 5% even during the dark period of Psychological Science.

Due to the improvement, alpha = .01 achieves the same goal in the later years without discarding as many results as the .005 criterion would.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Psychological Science, hand-coded data are available from coding by Motyl et al. (2017) and my own lab. The datasets were combined and analyzed with z-curve (Figure 4).

The ODR of 84% is higher than the ODR of 68% for automatic extraction. The EDR of 34% is identical to the estimate for automatic extraction. The ERR of 61% is 8 percentage points lower than the ERR for automatic extraction. Given the period effects on z-curve estimates, I also conducted a z-curve analysis for automatically extracted tests for the matching years (2003, 2004, 2010, 2016, 2020). The results were similar, ODR = 73%, EDR = 25%, and ERR = 64%. Thus, automatically extracted results produce similar results to results based on handcoded data. The main difference is that non-significant results are less likely to be focal tests.

Conclusion

The replicability report for Psychological Science shows (a) clear evidence of selection bias, (b) unacceptably high false positive risks at the conventional criterion for statistical significance, and modest replicability. However, time trend analyses show that credibility of published results decreased in the beginning of this century, but improved since 2015. Further improvements are needed to eliminate selection bias and increase the expected discovery rate by increasing power (reducing sampling error). Reducing sampling error is also needed to produce strong evidence against theoretical predictions that are important for theory development. The present results can be used as benchmark for further improvements that can increase the credibility of results in psychological science (e.g., more Registered Reports that publish results independent of outcomes). The results can also help readers of psychological science to chose significance criteria that match their personal preferences for risk and their willingness to “err on the side of discovery” (Bem, 2004).

The Relationship between Positive Affect and Negative Affect: It’s Complicated

About 20 years ago, I was an emotion or affect researcher. I was interested in structural models of affect, which was a hot research topic in the 1980s (Russell, 1980; Watson & Tellegen, 1985; Diener & Iran-Nejad, 1986′ Shaver et al., 1987). In the 1990s, a consensus emerged that the structure of affect has a two-dimensional core, but a controversy remained about the basic dimensions that create the two-dimensional space. One model assumed that Positive Affect and Negative Affect are opposite ends of a single dimension (like hot and cold are opposite ends of a bipolar temperature dimension). The other model assumed that Positive Affect and Negative Affect are independent dimensions. This controversy was never resolved, probably because neither model is accurate (Schimmack & Grob, 2000).

When Seligman was pushing positive psychology as a new discipline in psychology, I was asked to write a chapter for a Handbook of Methods in Positive Psychology. This was a strange request because it is questionable whether Positive Psychology is really a distinct discipline and there are no distinct methods to study topics under the umbrella term positive psychology. Nevertheless, I obliged and wrote a chapter about the relationship between Positive Affect and Negative Affect that questions the assumption that positive emotions are a new and previously neglected topic and the assumption that Positive Affect can be studied separately from Negative Affect. The chapter basically summarized the literature on the relationship between PA and NA up to this point, including some mini meta-analysis that shed light on moderators of the relationship between PA and NA.

As with many handbooks that are expensive and not easily available as electronic documents, the chapter had very little impact on the literature. WebofScience shows only 25 citations. As the topic is still unresolved, I thought I make the chapter available as a free text in addition to the Google Book option that is a bit harder to navigate.

Here is a PDF version of the chapter.

Key points

  • The correlation between PA and NA varies as a function of items, response formats, and other method factors.
  • Pleasure and displeasure are not opposite ends of a single bipolar dimension
  • Pleasure and displeasure are not independent.

Psychological Science and Real World Racism

The prompt for this essay is my personal experience with accusations of racism in response to my collaboration with my colleague Judith Andersen and her research team who investigated the influence of race on shooting errors in police officers’ annual certification (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2023a). Our article was heavily criticizes as racially insensitive and racially biased (Williams et al., 2023). We responded to the specific criticisms of our article (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2003b). This essay is takes a broader perspective on the study of race-related topics in psychological science. It is also entirely based on my own experiences and views and I do not speak for my colleagues.

Science

The term science is used to distinguish claims that are backed-up by scientific facts from claims that are based on other evidence or belief systems. For people who believe in science, these claims have a stronger influence on their personal belief systems than other claims. Take “flat-earth theorists” as an example. Most educated people these days believe that the Earth is round and point to modern astronomy as a science that supports this claim. However, some people seriously maintain the belief that the earth is flat (https://en.wikipedia.org/wiki/Behind_the_Curve). Debates between individuals or groups who “follow the science” or not are futile. In this regard, believing in science is like a religion. This article is addressed at readers who “believe in science.”

What does it mean to believe in science? A fundamental criterion that distinguishes science from other belief systems is falsifiability. At some point, empirical evidence has to be able to correct pre-existing beliefs. For this to happen, the evidence has to be strong. For example, there should be little doubt about the validity of the measures (e.g., thermometers are good measures of temperature) and the replicability of the results (different research teams obtain the same results). When these preconditions are fulfilled, scientific discoveries are made and knowledge is gained (e.g., better telescopes produce new discoveries in astronomy, microscopes showed the influence of bacteria on diseases, etc.). The success of Covid-19 vaccines (if you believe in science) was possible due to advances in microbiology. The modern world we live in would not exist without actions by individuals who believe in science.

Psychological Science

Psychological science emerged in the late 19th century as an attempt to use the scientific method to study human experiences and behavior. The biggest success stories in psychological science can be found in areas that make it possible to conduct tightly controlled laboratory studies. For example, asking people to read color words in the same color or a different color shows a robust effect that it is harder to name color of a color word if the color word does not match (say purple when the word purple is printed in green).

Psychological science of basic phenomena like perception and learning has produced many robust scientific findings. Many of these findings are so robust because they are universal; that is, shared by all humans. This is consistent with other evidence that humans are more alike than different from each other and that peripheral differences like height, hair texture, and pigmentation are superficial differences and not symptoms of clearly distinguishable groups of humans (different races).

Social Psychology

Social psychology emerged as a sub-discipline in psychological science in the 1950s. A major goal of social psychology was to use the methods of psychological science to study social behaviors with bigger social implications than the naming of colors. The most famous studies from the 1950 tried to explain the behavior of Germans during World War II who were involved in the Holocaust. The famous Milgram experiments, for example, showed that social pressure can have a strong influence on behavior. Asch showed that conformity pressure can make people say things that are objectively false. These studies are still powerful today because they used actual behaviors as the outcome. In Milgram’s studies participants were led to believe that they gave electro shocks to another person who screamed in pain.

From the beginning, social psychologists were also interested in prejudice (Allport, 1954), at a time when the United States were segregated and blatantly racist. White Americans’ racial attitudes were easy to study because White Americans openly admitted that they did not consider White and Black Americans to be equal. For example, in the 1950s, nearly 100% of Americans disapproved of interracial marriages, which were also illegal in some states at that time.

It was more difficult to study the influence of racism on behavior. To ensure that behavior is influenced by an individual’s race and not some other factor (psychology jargon for cause), it is necessary to keep all other causes constant and then randomly assign participants to the two conditions and show a difference in outcome. My search for studies of this type revealed only a handful of studies with small student samples that showed no evidence of prejudice (e.g., Genthner & Taylor, 1973). There are many reasons why these studies may have failed to produce evidence of prejudice. For example, participants knew that they were in a study and that their behaviors were observed, which may have influenced how they behaved. Most important is the fact that the influence of prejudice on behavior was not a salient topic in social psychology.

This changed in the late 1980s (at a time when I became a student of psychology), when social psychologists became interested in unconscious processes that were called implicit processes (Devine, 1989). The novel idea was that racial biases can influence behavior outside of conscious awareness. Thus, some individuals might claim that they have no prejudices, but their behaviors show otherwise. Twenty years later, this work led to the claim that most White people have racial biases that influence their behavior even if they do not want to (Banaji & Greenwald, 2013).

Notably, in the late 1980s, 40% of US Americans still opposed interracial marriages, showing that consciously accessible, old fashioned racism was still prevalent in the United States. However, the primary focus of social psychologists was not the study of prejudice, but the study of unconscious/implicit processes, implicit prejudice was just one of many implicit topics that were being the topic of investigation.

While the implicit revolution led to hundreds of studies that examined White people’s behaviors in responses to Black and White persons, the field also made an important methodological change. Rather than studying real behaviors to real people, most studies examined how fast participants can press a button in response to a stimulus (e.g. a name, a face, or simply the words Black/White) on a computer screen. The key problem with this research is that button presses on computer screens are not the same as button presses on dating profiles or pressing the trigger on a gun during a use of force situation.

This does not mean that these studies are useless, but it is evident that they cannot produce scientific evidence about the influence of race on behavior in the real world. In the jargon of psychological science, these studies lack external validity (i.e., the results cannot be generalized from button presses in computer tasks to real world behaviors).

Psychological Science Lacks Credibility

Psychology faces many challenges to be recognized as a science equal to physics, chemistry, or biology. One major challenge is that the behaviors of humans vary a lot more than the behaviors of electrons, atoms, or cells. As a result, many findings in social psychology are general trends that explain only a small portion of the variability in behavior (e.g., some White people are in interracial relationships). To deal with this large amount of variability (noise, randomness), psychologists rely on statistical methods that aim to detect small effects on the variability in behavior. Since the beginning of psychological science, the statistical method to find these effects is a statistical method called null-hypothesis significance testing or simply significance testing (Is p < .05?). Although this method has been criticized for decades, it continues to be taught to undergraduate students and is used to make substantive claims in research articles.

The problem with significance testing is that it is designed to confirm researchers’ hypotheses, but it cannot falsify them. Thus, the statistical tool cannot serve the key function of science to inform researchers that they ideas are wrong. As researchers are human and humans already have a bias to find evidence that supports their beliefs, significance testing is an ideal tool for scientists to delude themselves that their claims are supported by scientific evidence (p < .,05), when their beliefs are wrong.

Awareness of this problem increased after a famous social psychologist, Daryl Bem, used NHST to convince readers that humans have extrasensory perception and can foresee future events (Bem, 2011). Attesting to the power of confirmation bias, Bem still believes in ESP, but the broader community has realized that the statistical practices in social psychology are unscientific and that decades of published research lacks scientific credibility. It did not help that a replication project found that only 25% of published results in the most prestigious journals of social psychology could be replicated.

Despite growing awareness about the lack of credible scientific evidence, claims about prejudice and racism in textbooks, popular books, and media articles continue to draw on this literature because there is no better evidence (yet). The general public and undergraduate students make the false assumption that social psychologists are like astronomers who are interpreting the latest pictures from the new space telescope. Social psychologists are mainly presenting their own views as if they were based on scientific evidence, when there is no scientific evidence to support these claims. This explains why social psychologists often vehemently disagree about important issues. There is simply no shared empirical evidence that resolves these conflicts.

Thus, the disappointing and honest answer is that social psychology simply cannot provide scientific answers to real world questions about racial biases in behavior. Few studies actually examined real behavior, studies of button presses on computers have little ecological validity, and published results are often not replicable.

The Politicization of Psychological Science

In the absence of strong and unambiguous scientific evidence, scientists are no different from other humans and confirmation biases will influence scientist’s beliefs. The problem is that the general public confuses their status as university professors and researchers with expertise that is based on superior knowledge. As a result, claims by professors and researchers in journal articles or in books, talks, or newspaper interviews are treated as if they deserve more weight than other views. Moreover, other people may refer to the views of professors or their work to claim that their own view are scientific because they echo those printed in scientific articles. When these claims are not backed by strong scientific evidence, scientific articles become weaponized in political conflicts.

A scientific article on racial biases in use of force errors provides an instructive example. In 2019, social psychologist Joseph Cesario and four graduate students published an article on racial disparities in use of force errors by police (a.k.a., unnecessary killings of US civilians). The article passed peer-review at a prestigious scientific journal, the Proceedings of the National Academy of Sciences (PNAS). Like many journals these days, PNAS asks authors to provide a Public Significance Statement.

The key claim in the significance statement is that the authors found “no evidence of anti-Black or anti-Hispanic disparities across shootings.” Scientists may look at this statement and realize that it is not equivalent to the claim that “there is no racial bias in use of force errors.” First of all, the authors clearly say that they did not find evidence. This leaves the possibility that other people looking at the same data might have found evidence. Among scientists it is well known that different analyses can produce different results. Scientists also know the important distinction between the absence of evidence and evidence of the absence of an effect. The significance statement does not say that the results show that there are no racial biases, only that the authors did not find evidence for biases. However, significance statements are not written for scientists and it is easy to see how these statement could be (unintentionally or intentionally) misinterpreted as saying that science shows that there are no racial biases in police killings of innocent civilians.

And this is exactly what happened. Black-Lives-Anti-Matter Heather Mac Donald, used this research as “scientific evidence” to support the claim that the liberal left is fighting an unjustified “War on Cops” Her bio on Wikipedia shows that she received degrees in English, without any indication that she has a background in science. Yet, the Wall Street journal allowed her to summarize the evidence in an opinion article with the title “The myth of systemic police racism.” Thus, a racially biased and politically motivated non-scientist was able to elevate her opinion by pointing to the PNAS article as evidence that her opinion is the truth.

In this particular case, the journal was forced to retract the article after post-publication peer-reviewed revealed statistical errors in the paper and it became clear that the significance statement was misleading. An editorial reviewed this case-study of politicized science in great detail (Massey & Waters, 2020).

Although this editorial makes it clear that mistakes were made, it doesn’t go far enough in admitting the mistakes that were made by the journal editors. Most important, even if the authors had not made mistakes, it would be wrong to allow for any generalized conclusions in a significance statement. The clearest significance statement would be that “This is only one study of the issue with limitations and the evidence is insufficient to draw conclusions based on this study alone.” But journals are also motivated to exaggerate the importance of articles to increase their prestige.

The editorial also fails to acknowledge that the authors, reviewers, and editor were White and that it is unlikely that the article would have made misleading statements if African American researchers were involved in the research, peer-review, or the editorial decision process. To African Americans the conclusion that there is no racial bias in policing is preposterous, while it seemed plausible to the White researchers who gave the work the stamp of approval. Thus, this case study also illustrates the problems of systemic racism in psychology that African Americans are underrepresented and often not involved in research that directly affects them and their community.

My Colleague’s Research with Police Officers

My colleague, Judith Andersen, is a trained health psychologist, with a focus on stress and health. One area of research is how police officers cope with stress and traumatic experiences they encounter in their work. This research put her in a unique position to study racial biases in the use of force with actual police officers (i.e., many social psychologists studied shooting games with undergraduate students). Getting the cooperation of police department and individual officers to study such a highly politicized topic is not easy and without cooperation there are no PARTICIPANTS, no data, and no scientific evidence. A radical response to this reality would be to reject any data that require police officers’ consent. That is a principled response, but not a criticism of researchers who conduct studies and note the requirement of consent as a potential limitation and refrain from making bold statements that their data settle a political issue.

The actual study is seemingly simple. Officers have to pass a use of force test for certification to keep their service weapon on duty. To do so, officers go through a series of three realistic scenarios with their actual service weapon and do not know whether “shoot” or “don’t shoot” is the right response. Thus, they may fail the test if they fail to shoot in scenarios where shooting is the right response. The novel part of the study was to create two matched scenarios with a White or Black suspect and randomly assigned participating officers to these scenarios. Holding all other possible causes constant make it possible to see whether shooting errors are influenced by the race of a suspect.

After several journals, including PNAS, showed no interest in this work, it was eventually accepted for publication by the editor of The Canadian Journal of Behavioural Science. The journal also requires a Significance statement and we provided one.

Scientists might notice that our significance statement is essentially identical to Johnson et al.’s fateful significance statement. In plain English, we did not find evidence of racial biases in shooting errors. The problem is that significance testing often lead to the confusion of lack of evidence and evidence of no bias. To avoid this misinterpretation, we made it clear that our results cannot be interpreted as evidence that there are no biases. To do so, we emphasized that the shooting errors in the sample did show a racial bias. However, we could not rule out that this bias was unique to this sample and that the next sample might show no bias or even the opposite bias. We also point out that the bias in this sample might be smaller than the actual bias and that the actual bias might fully account for the real world disparities. In short, our significance statement is an elaborate, jargony way of saying “our results are inconclusive and have no real-world significance.”

It is remarkable that the editor published our article because 95% of articles in psychology present a statistically significant result that justifies a conclusion. This high rate of successful studies, however, is a problem because selective publishing of only significant results undermines the credibility of published results. Even cray claims like mental time travel are supported by statistically significant results. Only the publication of studies that failed to replicate these results help us to see that the original results were wrong. It follows that journals have to publish articles with inconclusive results to be credible and researchers have to be allowed to present inconclusive results to ensure that conclusive results are trustworthy. It also follows that not all scientific articles are in need of media attention and publicity. The primary aim of scientific journals is communication among scientists and to maintain a record of scientific results. Even Nadal or Federer did not win every tournament. So, scientists should be allowed to publish articles that are not winners and nobody should trust scientists who only publish articles that confirm their predictions.

It is also noteworthy that our results were inconclusive because the sample size was too small to draw stronger conclusions. However, it was the first study of its kind and it was already a lot of effort to get even these data. The primary purpose of publishing a study like this is to stimulate interest and to provide an example for future studies. Eventually, the evidence base grows and more conclusive results could be obtained. Ultimately it is up to the general public and policy makers to fund this research and to require participation of police departments in studies of racial bias. It would be foolish to criticize our study because it didn’t produce conclusive results in the first investigation. Even if the study had produced statistically significant results, replication studies would be needed before any conclusions can be drawn.

Social Activism in Science

Williams et al. (2023) wrote a critical commentary of our article with the title “Performative Shooting Exercises Do Not Predict Real-World Racial Bias in Police Officers” We were rather surprised by this criticism because our main finding was basically a non-significant, inconclusive result. Apparently, this was not the result that we were supposed to get or we should not have reported these results that contradict Williams et al.’s beliefs. Williams et al. start with the strong belief that any well-designed scientific study must find evidence for racial biases in shooting errors; otherwise there must be a methodological flaw. They are not shy to communicate this belief in their title. Our study of shooting errors during certification are called performative and they “do not predict real world racial biases in police officers.” The question is how Williams et al. (2023) know the real-world racial biases of police officers to make this claim.

The answer is that they do not know anything more than anybody else about the real racial biases of police officers (You are invited to read the commentary and see whether I missed that crucial piece of information). Their main criticism is that we made unjustified assumptions about the external validity of the certification task. “The principal flaw with Andersen et al.’s (2023) paper is unscientific assumptions around the validity of the recertification shooting test” That is, the bias that we observed in the certification task is taken at face value as information about the bias in real-world shooting situations.

The main problem with this criticism is that we never made the claim that biases in the certification task can be used to draw firm conclusions about biases in the real world. We even pointed out that we observed biases and that our results are consistent with the assumption that all of the racial disparities in real-world shootings are caused by racial biases in the immediate shooting decisions. As it turns out, Williams et al.’s critique is unscientific because it makes unscientific claims in the title and misrepresents our work. Our real sin was to be scientific and to publish inconclusive results that do not fit into the narrative of anti-police leftwing propaganda.

It is not clear why the authors were allowed to make many false and hurtful statements in their commentary, but personally I think it is better to have this example of politicization in the open to show that left-wing and right-wing political activists are trying to weaponize science to elevate their beliefs to the status of truth and knowledge.

Blatant examples of innocent African Americans killed by police officers (wikipedia) are a reason to conduct scientific studies, but these incidences cannot be used to evaluate the scientific evidence. And without good science, resources might be wasted on performative implicit bias training sessions that only benefit the trainers and do not protect the African American community.

Conclusion

The simple truth remains that psychological science has done little to answer real-world questions around race. Although social psychology has topics like prejudice and intergroup relationships as core topics, the research is often too removed from the real world to be meaningful. Unfortunately, incentives reward professors to use the inconclusive evidence selectively to confirm their own beliefs and then to present these beliefs as scientific claims. These pseudo-scientific claims are then weaponized by like-minded ideologues. This creates the illusion that we have scientific evidence, which is contradicted by the fact that opposing camps both cite science to believe they are right just like opposing armies can pray to the same God for victory.

To change this, stakeholders in science, like government funding organizations, need to change the way money is allocated. Rather than giving grants to White researchers at elite universities to do basic (a.k.a., irrelevant) research on button-presses of undergraduate students, money should be given to diverse research teams with a mandate to answer practical, real-world questions. The reward structure at universities also has to change. To collect real world data from 150 police officers is 100 times more difficult than collecting 20 brain measures from undergraduate students. Yet, a publication in a neuroscience journal is seen as more scientific and prestigious than an article in journal that addresses real-world problems that are by nature of interest to smaller communities.

Finally, it is important to recognize that a single study cannot produce conclusive answers to important and complex questions. All the major modern discoveries in the natural (real) sciences are made by teams. Funders need to provide money for teams that work together on a single important question rather than funding separate labs who work against each other. This is not new and has been said many times before, but so far there is little evidence of change. As a result, we have more information about galaxies millions of years ago than about our own behaviors and the persistent problem of racism in modern society. Don’t look to the scientists to provide a solution. Real progress has and will come from social activists and political engagement. And with every generation, more old racists will be replaced by a more open new generation. This is the way.

Who Holds Meta-Scientists accountable?

“Instead of a scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88).  [Fiske & Taylor, 1984)

A critical examination of Miller and Ulrich’s article “Optimizing Research Output:
How Can Psychological Research Methods Be Improved?” (https://doi.org/10.1146/annurev-psych-020821-094927)

Introduction

Meta-science has become a big business over the past decade because science has become a big business. With the increase in scientific production and the minting of Ph.D. students, competition has grown and researchers are pressured to produce ever more publications to compete with each other. At the same time, academia still pretends that it plays by the rules of English lords with “peer”-review and a code of honor. Even outright fraud is often treated like jaw-walking.

The field of meta-science puts researchers’ behaviors under the microscope and often reveals shady practices and shoddy results. However, meta-scientists are subject to the same pressures as the scientists they examine. They get paid, promoted, and funded based on the quantity of their publications, and citations. It is therefore reasonable to ask whether meta-scientists are any more trustworthy than other scientists. Sadly, that is not the case. Maybe this is not surprising because they are human like everybody else. Maybe the solution to human biases will be artificial intelligence programs. For now, the only way to reduce human biases is to call them out whenever you see them. Meta-scientists do not need meta-meta-scientists to hold them accountable, just like meta-scientists are not needed to hold scientists accountable. In the end, scientists hold each other accountable by voicing scientific criticism and responding to these criticisms. The key problem is that open exchange of arguments and critical discourse is often lacking because insiders use peer-review and other hidden power structures to silence criticism.

Here I want to use the chapter “Optimizing Research Output: How Can Psychological
Research Methods Be Improved?” by Jeff Miller and Rolf Ulrich as an example of biased and unscientific meta-science. The article was published in the series “Annual Reviews of Psychology” that publishes invited review articles. One of the editors is Susan Fiske, a social psychologists who once called critical meta-scientists like me “method terrorists” because they make her field look bad. So far, this series has published several articles on the replication crisis in psychology with titles like “Psychology’s Renaissance.” I was never asked to write or review any of these articles, although I have been invited to review articles on this topic by several editors of other journals. However, Miller and Ulrich did cite some of my work and I was curious to see how they cited it.

Consistent with the purpose of the series, Miller and Ulrich claim that their article provides “a (mostly) nontechnical overview of this ongoing metascientific work.” (p. 692). They start with a discussion of possible reasons for low replicability.

2. WHY IS REPLICABILITY SO POOR?

The state “there is growing consensus that the main reason for low replication rates is that many original published findings are spurious” (p. 693).

To support this claim they point out that psychology journals mostly publish statistically significant results (Sterling, 1959; Sterling et al., 1959), and then conclude “current evidence of low replication rates tends to suggest that many published findings are FPs [false positives] rather than TPs.[true positives]. This claim is simply wrong because it is very difficult to distinguish false positives from true positives with very low power to produce a significant result. They do not mention attempts to estimate the false positive rate (Jager & Leek, 2014; Gronau et al., 2016; Schimmack & Bartos, 2021). These methods typically show low to moderate estimates of the false positive rate and do not justify the claim that most replication failures occur when an article reported a false positive result.

Miller and Ulrich now have to explain how false positive results can enter the literature in large numbers when the alpha criterion of .05 is supposed to keep most of these results out of publications. The propose that many “FPs [false positive] may reflect honest research errors at many points during the research process” (p. 694). This argument ignores the fact that concerns about shady research practices first emerged when Bem (2011) published a 9 study article that seemed to provide evidence for pre-cognition. According to Miller and Ulrich we have to believe that Bem made 9 honest errors in a row that miraculously produced evidence for his cherished hypothesis that pre-cognition is real. If you believe this is possible, you do not have to read further and I wish you a good life. However, if you share my skepticism, you might feel relieved that there is actually meta-scientific evidence that Bem used shady practices to produce his evidence (Schimmack, 2018).

3. STATISTICAL CAUSES OF FALSE POSITIVES

Honest mistakes alone cannot explain a high percentage of false positive results in psychology journals. Another contributing factor has to be that psychologists test a lot more false hypotheses than true hypotheses. Miller and Ulrich suggest that social psychologists test only 1 out of 10 hypotheses tests tests a true hypothesis. Research programs with such a high rate of false hypotheses are called high-risk. However, this description does not fit the format of typical social psychology articles that have lengthy theory sections and often state “as predicted” in the results section, often repeatedly for similar studies. Thus, there is a paradox. Either social psychology is risky and results are surprising or it is theory-driven and results are predicted. It cannot be both.

Miller and Ulrich ignore the power of replication studies to reveal false positive results. This is not only true in articles with multiple replication studies, but across different articles that publish conceptual replication studies of the same theoretical hypothesis. How is it possible that all of these conceptual replication studies produced significant results, when the hypothesis is false? The answer is that researchers simply ignored replication studies that failed to produce the desired results. This selection bias, also called publication bias, is well-known and never called an honest mistake.

All of this gaslighting serves the purpose to present social psychologists as honest and competent researchers. High false positive rates and low replication rates happen “for purely statistical reasons, even if researchers use only the most appropriate scientific methods.” This is bullshit. Competent researchers would not hide non-significant results and continue to repeatedly test false hypotheses, while writing articles that claim all of the evidence supports their theories. Replication failures are not an inevitable statistical phenomenon. They are man-made in the service of self-preservation during early career stages and ego-preservation during later ones.

4. SUGGESTIONS FOR REDUCING FALSE POSITIVES

Conflating false positives and replication failures, Miller and Ulrich review suggestions to improve replication rates.

4.1. Reduce the ? Level

One solution to reducing false positive results is to lower the significance threshold. An influential article called for alpha to be set to .005 (1 out of 200 tests can produce a false positive result). However, Miller and Ulrich falsely cite my 2012 article in support of this suggestion. This ignores that my article made a rather different recommendation, namely to conduct fewer studies with a higher probability to provide evidence for a true hypothesis. This would also reduce the false positive rate without having to lower the alpha criterion. Apparently, they didn’t really read or understand my article.

4.2 Eliminate Questionable Research Practices

A naive reader might think that eliminating shady research practices should help to increase replication rates and to reduce false positive rates. For example, if all results have to be published, researchers would think twice about the probability of obtaining a significant results. Which sane researcher would test their cherished hypothesis twice with 50% power; that is, the probability of finding evidence for it. Just like flipping a coin twice, the chance of getting at least one embarrassing non-significant result would be 75%. Moreover, if they had to publish all of their results, it would be easy to detect hypotheses with low replication rates and either give up on them or increase sample sizes to detect small effect sizes. Not surprisingly, consumers of scientific research (e.g., undergraduate students) assume that results are reported honestly and scientific integrity statements often imply that this is the norm.

However, Miller and Ulrich try to spin this topic in a way that suggests shady practices are not a problem. They argue that shady practices are not as harmful as some researchers have suggested, citing my 2020 article, because “because QRPs also increase power by making it easier to reject null hypotheses that are false as well as those that are true (e.g., Ulrich & Miller 2020).” Let’s unpack this nonsense in more detail.

Yes, questionable researcher practices increase the chances of obtaining a significant result independent of the truth of the hypothesis. However, if researchers test only 1 true hypotheses for every 9 false hypotheses, QRPs can have a much more sever effect on the rate of significant results when the null-hypothesis is false. Also a false hypotheses starts with a low probability of a significant result when researchers are honest, namely 5% with the standard criterion of significance. In contrast, a true hypothesis can have anywhere between 5% and 100% power, limiting the room for shady practices to inflate the rate of significant results when the hypothesis is true. In short, the effect of shady practices are not equal and false hypotheses benefit more from shady practices than true hypotheses.

The second problem is that Miller and Ulrich conflate false positives and replication failures. Shady practices in original studies will also produce replication failures when the hypothesis is true. The reason is that shady practices lead to inflated effect size estimates, while the outcome of the honest replication study is based on the true population effect size. As this is often 50% smaller than the inflated estimates in published articles, replication studies with similar sample sizes are bound to produce non-significant results (Open Science Collaboration, 2015). Again, this is true even if the hypothesis is true (i.e., the effect size is not zero).

4.3 Increase Power

As Miller and Ulrich point out, increasing power has been a recommendation to improve psychological science (or a recommendation for psychology to become a science) for a long time (Cohen, 1961). However, they point out that this recommendation is not very practical because “it is very difficult to say what sample sizes are needed to attain specific target
power levels, because true effect sizes are unknown” (p. 698). This argument against proper planning of sample sizes is false for several reasons.

First, I advocated for higher power in the context of multi-study papers. Rather than conducting 5 studies with 20% power, researchers should use their resources to conduct one study with 80% power. The main reason researchers do not do this is that the single study might still not produce a significant result and they are allowed to hide underpowered studies that failed to produce a significant result. Thus, the incentive structure that rewards publication of significant results rewards researchers who conduct many underpowered studies and only report those that worked. Of course, Miller and Ulrich avoid discussing this reason for the lack of proper power analysis to maintain the image that psychologists are honest researchers with the best intentions.

Second, researchers do not need to know the population effect size to plan sample sizes. One way to plan future studies is to base the sample size on previous studies. This is of course what researchers have been doing only to find out that results do not replicate because the original studies used shady practices to produce significant results. Many graduate students who left academia spent years of their Ph.D. trying to replicate published findings and failed to do so. However, all of these failures remain hidden so that power analyses based on published effect sizes lead to more underpowered studies that do not work. Thus, the main reason why it is difficult to plan sample sizes is that the published literature reports inflated effect sizes that imply small samples are sufficient to have adequate power.

Finally, it is possible to plan studies with the minimal effect size of interest. These studies are useful because a non-significant result implies that the hypothesis is not important even if the strict nil-hypothesis is false. The effect size is just so small that it doesn’t really matter and requires extremely large effect sizes to study them. Nobody would be interested in doing studies on this irrelevant effects that require large resources. However, to know that the population (true) effect size is too small to matter, it is important to conduct studies that are able to estimate small effect sizes precisely. In contrast, Miller and Ulrich warn that sample sizes could be too large because large samples “provide high power to detect effects that are too small to be of practical interest”. (p. 698). This argument is rooted in the old statistical approach to ignore effect sizes and be satisfied with a conclusion that the effect size is not zero, p < .05, what Cohen called nil-hypothesis testing and others have called a statistical ritual. Sample sizes are never too large because larger samples provide more precision in the estimation of effect sizes, which is the only way to establish that a true effect size is too small to be important. A study that define the minimum effect size of interest and uses this effect size as the null-hypothesis can determine whether the effect is relevant or not.

4.4. Increase the Base Rate

Increasing the base rate means testing more true hypotheses. Of course, researchers do not know a priori which hypotheses are true or not. Otherwise, the study would not be necessary (actually many studies in psychology test hypotheses where the null-hypothesis is false a priori, but that is a different issue). However, hypotheses can be more or less likely to be true based on exiting knowledge. For example, exercise is likely to reduce weight, but counting backwards from 100 to 1 every morning is not likely to reduce weight. Many psychological studies are at least presented as tests of theoretically derived hypotheses. The better the theory, the more often a hypothesis is true and a properly powered study will produce a true positive result. Thus, theoretical progress should increase the percentage of true hypotheses that are tested. Moreover, good theories would even make quantitative predictions about effect sizes that can be used to plan sample sizes (see previous section).

Yet, Miller and Ulrich conclude that “researchers have little direct control over their base
rates” (p. 698). This statement is not only inconsistent with the role of theory in the scientific process, it is also inconsistent with the nearly 100% success rate in published articles that always show the predicted results, if only because the prediction was made after the results were observed rather than from an a priori theory (Kerr, 1998).

In conclusion, Miller and Ulrich’s review of recommendations is abysmal and only serves the purpose to exonerate psychologists from justified accusations that they are playing a game that looks like science, but is not science, because researchers are rewarded for publishing significant results that fail to provide evidence for hypotheses because even false hypotheses produce significant results with the shady practices that psychologists use.

5. OBJECTIONS TO PROPOSED CHANGES

Miller and Ulrich start this section with the statement that “although the above suggestions for reducing FPs all seem sensible, there are several reasonable objections to them” (p. 698). Remember one of the proposed changes was to curb the use of shady practices. According to Miller and Ulrich there is a reasonable objection to this recommendation. However, what would be a reasonable objection to the request that researchers should publish all of their data, even those that do not support their cherished theory? Every undergraduate student immediately recognizes that selective reporting of results undermines the essential purpose of science. Yet, Miller and Ulrich want readers to believe that there are reasonable objections to everything.

“Although to our knowledge there have been no published objections to the idea that QRPs should be eliminated to hold the actual Type 1 error rate at the nominal ? level, even this
suggestion comes with a potential cost. QRPs increase power by providing multiple opportunities to reject false null hypotheses as well as true ones” (p. 699).

Apparently, academic integrity only applies to students, but not to their professors when they go into the lab. Dropping participants, removing conditions, dependent variables, or entire studies, or presenting exploratory results as if they were predicted a priori are all ok because these practices can help to produce a significant result even when the nil-hypothesis is false (i.e., there is an effect).

This absurd objection has several flaws. First, it is based on the old and outdated assumption that the only goal of studies is to decide whether there is an effect or not. However, even Miller and Ulrich earlier acknowledged that effect sizes are important. Sometimes effect sizes are too small to be practically important. What they do not tell their readers is that shady practices produce significant results by inflating effect sizes, which can lead to the false impression that the true effect size is large, when it is actually tiny. For example, the effect size of an intervention to reduce implicit bias on the Implicit Association Test was d = .8 in a sample of 30 participants and shrank to d = .08 in a sample of 3,000 participants (cf. Schimmack, 2012). What looked like a promising intervention when shady practices were used, turned out to be a negligible effect in an honest attempt to investigate the effect size.

The other problem is of course that shady practices can produce significant results when a hypothesis is true and when a hypothesis is false. If all studies are statistically significant, statistical significance no longer distinguishes between true and false hypotheses (Sterling, 1959). It is therefore absurd to suggest that shady practices can be beneficial because they can produce true positive results. The problem of shady practices is the same problem as a liar. They sometimes say something true and sometimes they lie, but you don’t know when they are honest or lying.

9. CONCLUSIONS

The conclusion merely solidifies Miller and Ulrich’s main point that there are no simple recommendations to improve psychological science. Even the value of replications can be debated.

“In a research scenario with a 20% base rate of small effects (i.e., d = 0.2), for example, a researcher would have the choice between either running a certain number of large studies with ? = 0.005 and 80% power, obtaining results that are 97.5% replicable, or running six times as many small studies with ? = 0.05 and 40% power, obtaining results that are 67% replicable. It is debatable whether choosing the option producing higher replicability would necessarily result in the fastest scientific progress.”

Fortunately, we have a real example of scientific progress to counter Miller and Ulrich’s claim that fast science leads to faster scientific progress. The lesson comes from molecular genetics research. When it became possible to measure variability in the human genome, researchers were quick to link variations in one specific gene to variation in phenotypes. This candidate gene research produced many significant results. However, unlike psychological scientists journals in this area of research also published replication failures and it became clear that discoveries could often not be replicated. This entire approach has been replaced by collaborative projects that rely on very large data sets and many genetic predictors to find relationships. Most important, they reduced the criterion for significance from .05 to .000000005 to increase the ratio of true positives and false positives. The need for large samples slows down this research, but at least this approach has produced some solid findings.

In conclusion, Miller and Ulrich pretend to engage in a scientific investigation of scientific practices and a reasonable discussion of their advantages and disadvantages. However, in reality they are gaslighting their readers and fail to point out a simple truth about science. Science is build on trust and trust requires honest and trustworthy behavior. The replication crisis in psychology has revealed that psychological science is not trustworthy because researchers use shady practices to support their cherished theories. While they pretend to subject their theories to empirical tests, the tests are a sham and rigged in their favor. The researcher always wins because they have control over the results that are published. As long as these shady practices persist, psychology is not a science. Miller and Ulrich disguise this fact in a seemingly scientific discussion of trade-offs, but there is no trade-off between honesty and lying in science. Only scientists who report all of their data and analyses decision can be trusted. This seems obvious to most consumers of science, but it is not. Psychological scientists who are fed up with the dishonest reporting of results in psychology journals created the term Open Science to call for transparent reporting and open sharing of data, but these aspects of science are integral to the scientific method. There is no such thing as closed science where researchers go to their lab and then present a gold nugget and claim to have created it in their lab. Without open and transparent sharing of the method, nobody should believe them. The same is true for contemporary psychology. Given the widespread use of shady practices, it is necessary to be skeptical and to demand evidence that shady practices were not used.

It is also important to question the claims of meta-psychologists. Do you really think it is ok to use shady practices because they can produce significant results when the nil-hypothesis is false? This is what Miller and Ulrich want you to believe. If you see a problem with this claim, you may wonder what other claims are questionable and not in the best interest of science and consumers of psychological research. In my opinion, there is no trade-off between honest and dishonest reporting of results. One is science, the other is pseudo-science. But hey, that is just my opinion and the way the real sciences work. Maybe psychological science is special.

Gaslighting about Replication Failures

This blog post is a review of a manuscript that hopefully will never be published, but it probably will be. In that case, it is a draft for a PubPeer comment. As the ms. is under review, I cannot share the actual ms., but the review makes clear what the authors are trying to do.

Review

I assume that I was selected as a reviewer for this manuscript because the editor recognized my expertise in this research area.  While most of my work on replicability has been published in the form of blog posts, I have also published a few peer-reviewed publications that are relevant to this topic. Most important, I have provided estimates of replicability for social psychology using the most advanced method to do so, z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020), using the extensive coding by Motyl et al. (2017) (see Schimmack, 2020).  I was surprised that this work was not mentioned.

In contrast, Yeager et al.’s (2019) replication study of 12 experiments is cited and as I recall 11 of the 12 studies replicated successfully. So, it is not clear why this study is cited as evidence that replication attempts often “producing pessimistic results”

While I agree that there are many explanations that have been offered for replication failures, I do not agree that listing all of these explanations is impossible and that it is reasonable to focus on some of these explanations, especially if the main reason is left out. Namely, the main reason for replication failures is that original studies are conducted with low statistical power and only those that achieve significance are published (Sterling et al., 1995; Schimmack, 2020).  Omitting this explanation undermines the contribution of this article.

The listed explanations are

(1) original articles making use of questionable research practices that result in Type I errors

This explanation conflates two problems. QRPs are used to get significance when power to do so is low, but we do not know whether the population effect size is zero (type-I error) or above zero (type-II error).

(2) original research’s pursuit of counterintuitive findings that may have lower a priori probabilities and thus poor chances at replication

This explanations assumes that there are a lot of type-I errors, but we don’t really know whether the population effect size is zero or not. So, this is not a separate explanation, but rather an explanation why we might have many type-I errors assuming that we do have many type-I errors, which we do not know.

(3) the presence of unexamined moderators that produce differences between original and replication research (Dijksterhuis, 2014; Simons et al., 2017),

This citation ignores that empirical tests of this hypothesis have failed to provide evidence for it (van Bavel et al., 2016).

4) specific design choices in original or replication research that produce different conclusions (Bouwmeester et al., 2017; Luttrell et al., 2017; Noah et al., 2018).

This argument is not different from (3). Replication failures are attributed to moderating factors that are always possible because exact replications are impossible.

To date, discussions of possible explanations for poor replication have generally been presented as distinct accounts for poor replication, with little attempt being made to organize them into a coherent conceptual framework. 

This claim ignores my detailed discussion of the various explanations including some not discussed by the authors (Schooler decline effect; Fiedler, regression to the mean; Schimmack, 2020).

The selection of journals is questionable. Psychological Science is not a general (meta)-psychological journal. Instead there are two journals, The Journal of General Psychology and Meta-Psychology that contain relevant articles.

The authors then introduce Cook and Campbell’s typology of validity and try to relate it to accounts of replication failures based on some work by Fabrigar et al. (2020). This attempt is flawed because validity is a broader construct than replicability or reliability.  Measures can be reliable and correlations can be replicable even if the conclusions drawn from these findings are invalid. This is Intro Psych level stuff.

Statistical conclusion validity is concerned with the question of “whether or not two or more variables are related.”  This is of course nothing else than the distinction between true and false conclusions based on significant or non-significant results.  As noted above, even statistical conclusion validity is not directly related to replication failures because replication failures do not tell us whether the population effect size is zero or not.  Yet, we might argue that there is a risk of false positive conclusions when statistical significance is achieved with QRPs and these results do not replicate. So, in some sense statistical conclusion validity is tied to the replication crisis in experimental social psychology.

Internal validity is about the problem of inferring causality from correlations. This issue has nothing to do with the replication crisis because replication failures can occur in experiments and correlational studies. The only indirect link to internal validity is that experimental social psychology prided itself on the use of between-subject experiments to maximize internal validity and minimize demand effects, but often used ineffective manipulations (priming) that required QRPs to get significance especially in the tiny samples that were used because experiments are more time-consuming and labor intensive. In contrast, survey studies often are more replicable because they have larger samples.  But the key point remains, it would be absurd to explain replication failures directly as a function of low internal validity.

Construct validity is falsely described as “the degree to which the operationalizations used in the research effectively capture their intended constructs.”  The problem here is the term operationalization. Once a construct is operationalized with some procedure, it is defined by the procedure (intelligence is what the IQ test measures) and there is no way to challenge the validity of the construct. In contrast, measurement implies that constructs exist independent of one specific procedure and it is possible to examine how well a measure reflects variation in the construct (Cronbach & Meehl, 1955).  That said, there is no relationship between construct validity and replicability because systematic measurement error can produce spurious correlations between measures in correlational studies that are highly replicable (e.g., social desirable responding). In experiments, systematic measurement error will attenuate effect sizes, but it will do so equally in original studies and replication studies. Thus, low construct validity also provides no explanation for replication failures.

External validity is defined as “the degree to which an effect generalizes to different populations and contexts”  This validation criterion is also only slightly related to replication failures when there are concerns about contextual sensitivity or hidden moderators. A replication study in a different population or context might fail because the population effect size varies across populations or contexts. While this is possible, there is little evidence that contextual sensitivity is a major factor.

In short, it is a red herring in explanations for replication failures or the replication crisis to talk about validity. Replicability is necessary but not sufficient for good science.

It is therefore not surprising that the authors found most discussions of replication failures focus on statistical conclusion validity. Any other finding would make no sense. It is just not clear why we needed a text analysis to reveal this.

However, the authors seem to be unable to realize that the other types of validity are not related to replication failures when they write “What does this study add? Identifies that statistical conclusion validity is over-emphasized in replication analysis”

Over-emphasized???  This is an absurd conclusion based on a failure to make a clear distinction between replicability/reliability and validity.  

Ulrich Schimmack

Replicability of Research in Frontiers of Psychology

Summary

The z-curve analysis of results in this journal shows (a) that many published results are based on studies with low to modest power, (b) selection for significance inflates effect size estimates and the discovery rate of reported results, and (c) there is no evidence that research practices have changed over the past decade. Readers should be careful when they interpret results and recognize that reported effect sizes are likely to overestimate real effect sizes, and that replication studies with the same sample size may fail to produce a significant result again. To avoid misleading inferences, I suggest using alpha = .005 as a criterion for valid rejections of the null-hypothesis. Using this criterion, the risk of a false positive result is below 2%. I also recommend computing a 99% confidence interval rather than the traditional 95% confidence interval for the interpretation of effect size estimates.

Given the low power of many studies, readers also need to avoid the fallacy to report non-significant results as evidence for the absence of an effect. With 50% power, the results can easily switch in a replication study so that a significant result becomes non-significant and a non-significant result becomes significant. However, selection for significance will make it more likely that significant results become non-significant than observing a change in the opposite direction.

The average power of studies in a heterogeneous journal like Frontiers of Psychology provides only circumstantial evidence for the evaluation of results. When other information is available (e.g., z-curve analysis of a discipline, author, or topic, it may be more appropriate to use this information).

Report

Frontiers of Psychology was created in 2010 as a new online-only journal for psychology. It covers many different areas of psychology, although some areas have specialized Frontiers journals like Frontiers in Behavioral Neuroscience.

The business model of Frontiers journals relies on publishing fees of authors, while published articles are freely available to readers.

The number of articles in Frontiers of Psychology has increased quickly from 131 articles in 2010 to 8,072 articles in 2022 (source Web of Science). With over 8,000 published articles Frontiers of Psychology is an important outlet for psychological researchers to publish their work. Many specialized, print-journals publish fewer than 100 articles a year. Thus, Frontiers of Psychology offers a broad and large sample of psychological research that is equivalent to a composite of 80 or more specialized journals.

Another advantage of Frontiers of Psychology is that it has a relatively low rejection rate compared to specialized journals that have limited journal space. While high rejection rates may allow journals to prioritize exceptionally good research, articles published in Frontiers of Psychology are more likely to reflect the common research practices of psychologists.

To examine the replicability of research published in Frontiers of Psychology, I downloaded all published articles as PDF files, converted PDF files to text files, and extracted test-statistics (F, t, and z-tests) from published articles. Although this method does not capture all published results, there is no a priori reason that results reported in this format differ from other results. More importantly, changes in research practices such as higher power due to larger samples would be reflected in all statistical tests.

As Frontiers of Psychology only started shortly before the replication crisis in psychology increased awareness about the problem of low statistical power and selection for significance (publication bias), I was not able to examine replicability before 2011. I also found little evidence of changes in the years from 2010 to 2015. Therefore, I use this time period as the starting point and benchmark for future years.

Figure 1 shows a z-curve plot of results published from 2010 to 2014. All test-statistics are converted into z-scores. Z-scores greater than 1.96 (the solid red line) are statistically significant at alpha = .05 (two-sided) and typically used to claim a discovery (rejection of the null-hypothesis). Sometimes even z-scores between 1.65 (the dotted red line) and 1.96 are used to reject the null-hypothesis either as a one-sided test or as marginal significance. Using alpha = .05, the plot shows 71% significant results, which is called the observed discovery rate (ODR).

Visual inspection of the plot shows a peak of the distribution right at the significance criterion. It also shows that z-scores drop sharply on the left side of the peak when the results do not reach the criterion for significance. This wonky distribution cannot be explained with sampling error. Rather it shows a selective bias to publish significant results by means of questionable practices such as not reporting failed replication studies or inflating effect sizes by means of statistical tricks. To quantify the amount of selection bias, z-curve fits a model to the distribution of significant results and estimates the distribution of non-significant (i.e., the grey curve in the range of non-significant results). The discrepancy between the observed distribution and the expected distribution shows the file-drawer of missing non-significant results. Z-curve estimates that the reported significant results are only 31% of the estimated distribution. This is called the expected discovery rate (EDR). Thus, there are more than twice as many significant results as the statistical power of studies justifies (71% vs. 31%). Confidence intervals around these estimates show that the discrepancy is not just due to chance, but active selection for significance.

Using a formula developed by Soric (1989), it is possible to estimate the false discovery risk (FDR). That is, the probability that a significant result was obtained without a real effect (a type-I error). The estimated FDR is 12%. This may not be alarming, but the risk varies as a function of the strength of evidence (the magnitude of the z-score). Z-scores that correspond to p-values close to p =.05 have a higher false positive risk and large z-scores have a smaller false positive risk. Moreover, even true results are unlikely to replicate when significance was obtained with inflated effect sizes. The most optimistic estimate of replicability is the expected replication rate (ERR) of 69%. This estimate, however, assumes that a study can be replicated exactly, including the same sample size. Actual replication rates are often lower than the ERR and tend to fall between the EDR and ERR. Thus, the predicted replication rate is around 50%. This is slightly higher than the replication rate in the Open Science Collaboration replication of 100 studies which was 37%.

Figure 2 examines how things have changed in the next five years.

The observed discovery rate decreased slightly, but statistically significantly, from 71% to 66%. This shows that researchers reported more non-significant results. The expected discovery rate increased from 31% to 40%, but the overlapping confidence intervals imply that this is not a statistically significant increase at the alpha = .01 level. (if two 95%CI do not overlap, the difference is significant at around alpha = .01). Although smaller, the difference between the ODR of 60% and the EDR of 40% is statistically significant and shows that selection for significance continues. The ERR estimate did not change, indicating that significant results are not obtained with more power. Overall, these results show only modest improvements, suggesting that most researchers who publish in Frontiers in Psychology continue to conduct research in the same way as they did before, despite ample discussions about the need for methodological reforms such as a priori power analysis and reporting of non-significant results.

The results for 2020 show that the increase in the EDR was a statistical fluke rather than a trend. The EDR returned to the level of 2010-2015 (29% vs. 31), but the ODR remained lower than in the beginning, showing slightly more reporting of non-significant results. The size of the file drawer remains large with an ODR of 66% and an EDR of 72%.

The EDR results for 2021 look again better, but the difference to 2020 is not statistically significant. Moreover, the results in 2022 show a lower EDR that matches the EDR in the beginning.

Overall, these results show that results published in Frontiers in Psychology are selected for significance. While the observed discovery rate is in the upper 60%s, the expected discovery rate is around 35%. Thus, the ODR is nearly twice the rate of the power of studies to produce these results. Most concerning is that a decade of meta-psychological discussions about research practices has not produced any notable changes in the amount of selection bias or the power of studies to produce replicable results.

How should readers of Frontiers in Psychology articles deal with this evidence that some published results were obtained with low power and inflated effect sizes that will not replicate? One solution is to retrospectively change the significance criterion. Comparisons of the evidence in original studies and replication outcomes suggest that studies with a p-value below .005 tend to replicate at a rate of 80%, whereas studies with just significant p-values (.050 to .005) replicate at a much lower rate (Schimmack, 2022). Demanding stronger evidence also reduces the false positive risk. This is illustrated in the last figure that uses results from all years, given the lack of any time trend.

In the Figure the red solid line moved to z = 2.8; the value that corresponds to p = .005, two-sided. Using this more stringent criterion for significance, only 45% of the z-scores are significant. Another 25% were significant with alpha = .05, but are no longer significant with alpha = .005. As power decreases when alpha is set to more stringent, lower, levels, the EDR is also reduced to only 21%. Thus, there is still selection for significance. However, the more effective significance filter also selects for more studies with high power and the ERR remains at 72%, even with alpha = .005 for the replication study. If the replication study used the traditional alpha level of .05, the ERR would be even higher, which explains the finding that the actual replication rate for studies with p < .005 is about 80%.

The lower alpha also reduces the risk of false positive results, even though the EDR is reduced. The FDR is only 2%. Thus, the null-hypothesis is unlikely to be true. The caveat is that the standard null-hypothesis in psychology is the nil-hypothesis and that the population effect size might be too small to be of practical significance. Thus, readers who interpret results with p-values below .005 should also evaluate the confidence interval around the reported effect size, using the more conservative 99% confidence interval that correspondence to alpha = .005 rather than the traditional 95% confidence interval. In many cases, this confidence interval is likely to be wide and provide insufficient information about the strength of an effect.

Which Social Psychologists Can you Trust?

Social psychology has an open secret. For decades, social psychologists conducted experiments with low statistical power (i.e., even if the predicted effect is real, their study could not detect it with p < .05), but their journals were filled with significant (p < .05) results. To achieve significant results, social psychologists used so-called questionable research practices that most lay people or undergraduate students consider to be unethical. The consequences of these shady practices became apparent in the past decade when influential results could not be replicated. The famous reproducibility project estimated that only 25% of published significant results are replicable. Most undergraduate students who learn about this fact are shocked and worry about the credibility of results in their social psychology textbooks.

Today, there are two types of social psychologists. Some are actively trying to improve the credibility of social psychology by adopting open science practices such as preregistration of hypothesis, sharing open data, and publishing non-significant results rather than hiding these findings. However, other social psychologists are actively trying to deflect criticism. Unfortunately, it can be difficult for lay people, journalists, or undergraduate students to make sense of articles that make seemingly valid arguments, but only serve the purpose to protect the image of social psychology as a science.

As somebody who has followed the replication crisis in social psychology for the past decade, I can provide some helpful information. In the blog post , I want to point out that Duane T. Wegener and Leandre R. Fabrigar have made numerous false arguments against critics of social psychology, and that their latest article “Evaluating Research in Personality and Social Psychology: Considerations of Statistical Power and Concerns About False Findings” ignores the replication crisis in social psychology and the core problem of selectively publishing significant results from underpowered studies.

The key point of their article is that “statistical power should be de-emphasized in
comparison to current uses in research evaluation
” (p. 1105).

To understand why this is a strange recommendation, it is important to understand that power is simply the probability of producing evidence for an effect, when an effect exists. When the criterion for evidence is a p-value below .05, it means the probability of obtaining this desired outcome. One advantage of high power is that researchers get the correct result. In contrast, a study with low power is likely to produce the wrong result called a type-II error. While the study tested a correct hypothesis, the results fail to provide sufficient support for it. As these failures can be due to many reasons (low power or the theory is wrong), they are difficult to interpret and to publish. Often these studies remain unpublished, the published record is biased, and resources were wasted. Thus, high power is a researcher’s friend. To make a comparison, if you could gamble on a slot machine with a 20% chance of winning or an 80% chance of winning, which machine would you pick? The answer is simple. Everybody would rather want to win. The problem is only that researchers have to invest more resources in a single study to increase power. They may not have enough money or time to do so. So, they are more like desperate gamblers. You need a publication, you don’t have enough resources for a well-powered study, so you do a low powered study and hope for the best. Of course, many desperate gamblers lose and are then even more desperate. That is where the analogy ends. Unlike gamblers in a casino, researchers are their own dealers and can use a number of tricks to get the desired outcome (Simmons et al., 2011). Suddenly, a study with only 20% power (chance of winning honestly) can have a chance of winning of 80% or more.

This brings us to the second advantage of high-powered studies. Power determines the outcome of a close replication study. If a researcher conducted a study with 20% power and found some tricks to get significance, the probability of replicating the result honestly is again only 20%. Many unsuspecting graduate students have wasted precious years trying to build on studies that they were not able to replicate. Unless they quickly learned the dark art of obtaining significant results with low power, they did not have a competitive CV to get a job. Thus, selective publishing of underpowered studies is demoralizing and rewards cheating.

None of this is a concern for Wegener and Fabrigar, who do not cite influential articles about the use of questionable research practices (John et al., 2012) or my own work that uses estimates of observed power to reveal those practices (Schimmack, 2012; see also Francis, 2012). Instead, they suggest that “problems with the overuse of power arise when the pre-study concept of power is used retrospectively to evaluate completed research” (p. 1115). The only problem that arises from estimating actual power of completed studies, however, is the detection of questionable practices that produce more reported significant results (often 100%) than one would expect given the low power to do so. Of course, for researchers who want to use QRPs to produce inflated evidence for their theories, this is a a problem However, for consumers of research, the detection of questionable results is desirable so that they can ignore this evidence in favor of honestly reported results based on properly powered studies.

The bulk of Wegener and Fabrigar’s article discusses the relationship between power and the probability of false positive results. A false positive result occurs when a statistically significant result is obtained in the absence of a real effect. The standard criterion of statistical significance, p < .05, states that a researcher that tests 100 false hypothesis without a real effect is expected to obtain 95 non-significant results and 5 false positive results. This may sound sufficient to keep false positive results at a low level. However, the false positive risk is a conditional probability based on a significant result. If a researcher conducts 100 studies, obtains 5 significant results, and interprets these results as real effects, the researcher has a false positive rate of 100% because 5 significant results are expected by chance along. An honest researcher would conclude from a series of studies with only 5 out of 100 significant results that they found no evidence for a real effect.

Now let’s consider a researcher that conducted 100 studies and obtained 24 significant results. As 24 is a lot more than the expected 5 studies by chance along, the researcher can conclude that at least some of the 24 significant results are caused by real effects. However, it is also possible that some of these results are false positives. Soric (1989 – not cited by Wegener and Fabrigar – derived a simple formula to estimate the false discover risk. The formula makes the assumption that studies of real effects have 100% power to detect a real effect. As a result, there are zero studies that fail to provide evidence for a real effect. This assumption makes it possible to estimate the maximum percentage of false positive results.

In this simple example, we have 4 false positive results and 20 studies with evidence for a real effect. Thus, the false positive risk is 4 / 24 = 17%. While 17% is a lot more than 5%, it is still pretty low and doesn’t warrant claims that “most published results are false” (Ioannidis, 2005). Yet, it is also not very reassuring if 17% of published results might be false positives (e.g., 17% of cancer treatments actually do not work). Moreover, based on a single study, we do not know which of the 24 results are true results and false results. With a probability of 17% (1/6), trusting a result is like playing Russian roulette. The solution to this problem is to conduct a replication study. In our example, the 20 true effects will produce significant results again because they were obtained with 100% power to do so. However, the chance to replicate one of the 4 false positive results is only 5/100 * 5 / 100 = 25 / 10,000 = 0.25%. So, with high-powered studies, a single replication study can separate true and false original findings.

Things look different in a world with low powered studies. Let’s assume that studies have only 25% power to produce a significant result, which is in accordance with the success rate in replication studies in social psychology (Open Science Collaboration, 2005).

In this scenario, there is only 1 false positive result and the false positive risk is only 1 out of 21, ~ 5%. Of course, researchers do not know this and have to wonder whether some of 21 significant results are false positives. When they conduct a replication study, only 6 (25/100 * 25/100) of their 20 significant results replicate. Thus, a single replication study does not help to distinguish true and false findings. This leads to confusion and the need for additional studies to separate true and false findings, but low power will produce inconsistent results again and again. The consequences can be seen in the actual literature in social psychology. Many literatures are a selected set of inconsistent results that do not advance theories.

In sum, high powered studies quickly separate true and false findings, whereas low powered studies produce inconsistent results that make it difficult to separate true and false findings (Maxwell, 2004, not cited by Wegener & Fabrigar).

Actions speak louder than Words

Over the past decade, my collaborators and I I have developed powerful statistical tools to estimate the power of studies that were conducted (Bartos & Schimmack, 2021; Brunner & Schimmack, 2022; Schimmack, 2012). In combination with Soric’s (1989) formula, estimates of actual power can also be used to estimate the real false positive risk. Below, I show some results when this method is applied to social psychology. I focus on the journal Personality and Social Psychological Bulletin for two reasons. First, Wegener and Fabrigar were co-editors of this journal right after concerns about questionable research practices and low power became a highly discussed topic and some journal editors changed policies to increase replicability of published results (e.g., Steven Lindsay at Psychological Science). Examining the power of studies published in PSPB when Wegener and Fabrigar were editors provides objective evidence about their actions in response to concerns about replication failures in social psychology. Another reason to focus on PSPB is that Wegener and Fabrigar published their defense of low powered research in this journal, suggesting a favorable attitude towards their position by the current editors. We can therefore examine whether the current editors changed standards or not. Finally, PSPB was edited from 2017 to 2021 by Chris Crandall, who has been a vocal defender of results obtained with questionable research practices on social media.

Let’s start with the years before concerns about replication failures became openly discussed. I focus on the years 2000 to 2012.

Figure 1 shows a z-curve plot of automatically extracted statistical results published in PSPB from 2000 to 2012. All statistical results are converted into z-scores. A z-curve plot is a histogram that shows the distribution of z-scores. One important aspect of a z-curve plot is the percentage of significant results. All z-scores greater than 1.96 (the solid vertical red line) are statistically significant with p < .05 (two-sided). Visual inspection shows a lot more significant results than non-significant results. More precisely, the percentage of significant results (i.e., the Observed Discovery Rate, EDR) is 71%.

Visual inspection of the histogram also shows a strange shape to the distribution of z-scores. While the peak of the distribution is at the point of significance, the shape of the distribution shows a rather steep drop of z-scores just below z = 1.96. Moreover, some of these z-scores are still used to claim support for a hypothesis often called marginally significant. Only z-scores below 1.65 (p < .10, two-sided or .0 5 one-sided, the dotted red line) are usually interpreted as non-significant results. The distribution shows that these results are less likely to be reported. This wonky distribution of z-scores suggests that questionable research practices were used.

Z-curve analysis makes it possible to estimate statistical power based on the distribution of statistically significant results only. Without going into the details of this validated method, the results suggest that the power of studies (i.e., the expected discovery rate, EDR) would only produce 23% significant results. Thus, the actual percentage of 71% significant results is inflated by questionable practices. Moreover, the 23% estimate is consistent with the fact that only 25% of unbiased replication studies produce a significant result (Open Science Collaboration, 2005). With 23% significant results, Soric’s formula yields a false positive risk of 18%. That means, roughly 1 out of 5 published results could be a false positive result.

In sum, while Wegener and Fabrigar do not mention replication failures and questionable research practices, the present results confirm the explanation of replication failures in social psychology as a consequence of using questionable research practices to inflate the success rate of studies with low power (Schimmack, 2020).

Figure 2 shows the z-curve plot for results published during Wegener and Fabrigar’s reign as editors. The results are easily summarized. There is no significant change. Social psychologists continued to publish ~70% significant results with only 20% power to do so. Wegener and Fabrigar might argue that there was not enough time to change practices in response to concerns about questionable practices. However, their 2022 article provides an alternative explanation. They do not consider it a problem when researchers conduct underpowered studies. Rather, the problem is when researchers like me estimate the actual power of studies and reveal that massive use of questionable practices.

The next figure shows the results for Chris Chrandall’s years as editor. While the percentage of significant results remained at 70%, power to produce these results increased to 32%. However, there is uncertainty about this increase and the lower limit of the 95%CI is still only 21%. Even if there was an increase, it would not imply that Chris Crandall caused this increase. A more plausible explanation is that some social psychologists changed their research practices and some of this research was published in PSPB. In other words, Chris Crandall and his editorial team did not discriminate against studies with improved power.

It is too early to evaluate the new editorial team lead by Michael D. Robinson, but for the sake of completeness, I am also posting the results for the last two years. The results show a further increase in power to 48%. Even the lower limit of the confidence interval is now 36%. Thus, even articles published in PSPB are becoming more powerful, much to the dismay of Wegener and Fabrigar, who believe that “the recent overemphasis on statistical
power
should be replaced by a broader approach in which statistical and conceptual forms of validity are considered together” (p. 1114). In contrast, I would argue that even an average power of 48% is ridiculously low. An average power of 48% implies that many studies have even less than 48% power.

Conclusion

More than 50 years ago, famous psychologists Amos Tversky and Daniel Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing
to confirm a valid research hypothesis” (p. 110). Wegener and Fabrigar prove them wrong. Not only are they willing to conduct these studies, they even propose that doing so is scientific and that demanding more power can have many negative side-effects. Similar arguments have been made by other social psychologists (Finkel, Eastwick, Reis, 2017).

I am siding with Kahneman, who realized too late that he placed too much trust in questionable results produced by social psychologists and compared some of this research to a train wreck (Kahneman, 2017). However, there is no consensus among psychologists and readers of social psychological research have to make up their own mind. This blog post only points out that social psychology lacks clear scientific standards and no proper mechanism to ensure that theoretical claims rest on solid empirical foundations. Researchers are still allowed to use questionable research practices to present overly positive results. At this point, the credibility of results depends on researchers’ willingness to embrace open science practices. While many young social psychologists are motivated to do so, Wegener and Fabrigar’s article shows that they are facing resistance from older social psychologists who are trying to defend the status quo of underpowered research.

Publication Politics: When Your Invited Submission is Disinvited

I am not the first and I will not be the last to point out that the traditional peer-review process is biased. After all, who would take on the thankless job of editing a journal if it would not come with the influence and power to select articles you like and to reject articles you don’t like. Authors can only hope that they find an editor who favors their story during the process of shopping around a paper. This is a long and frustrating process. My friend Rickard Carlsson created a new journal that operates differently with a transparent review process and virtually no rejection rate. Check out Meta-Psychology. I published two articles there that reported results based on math and computer simulations. Nobody challenged the validity, but other journals rejected the work based on politics (AMMPS rejection).

The biggest event in psychology, especially social psychology, in the past decade (2011-2020) was the growing awareness of the damage caused by selective publishing of significant results. It has long been known that psychology journals nearly exclusively publish statistically significant results (Sterling, 1959). This made it impossible to publish studies with non-significant results that could correct false positive results. It was long assumed that this was not a problem because false positive results are rare. What changed over the past decade was that researchers published replication failures that cast doubt on numerous classic findings in social psychology such as unconscious priming or ego-depletion.

Many, if not most, senior social psychologists have responded to the replication crisis in their field with a variety of defense mechanisms, such as repression or denial. Some have responded with intellectualization/rationalization and were able to publish their false arguments to dismiss replication failures in peer-reviewed journals (Bargh, Baumeister, Gilbert, Fiedler, Fiske, Nisbett, Stroebe, Strack, Wilson, etc., to name the most prominent ones). In contrast, critics had a harder time to make their voices heard. Most of my work on this topic has been published in blog posts in part because I don’t have the patience and frustration tolerance to deal with reviewer comments. However, this is not the only reason and in this blog post I want to share what happened when Moritz Heene and I were invited by Christiph Klauer to write an article on this topic for the German journal “Psychological Rundschau”.

For readers who do not know Christipher; he is a very smart social psychologists who worked as an assistant professor with Hubert Feger when I was an undergraduate student. I respect his intelligence and his work such as his work on the Implicit Association Test.

Maybe he invited us to write a commentary because he knew me personally. Maybe he respected what we had to say. In any case, we were invited to write an article and I was motivated to get an easy ‘peer-reviewed’ publication, even if nobody outside of Germany cares about a publication in this journal.

After submitting our manuscript, I received the following response in German.
I used http://www.DeepL.com/Translator (free version) to share an English version.

Thu 2016-04-14 3:50 AM

Dear Uli,

Thank you very much for the interesting and readable manuscript. I enjoyed reading it and can agree with most of the points and arguments. I think this whole debate will be good for psychology (and hopefully social psychology as well), even if some are struggling at the moment. In any case, the awareness of the harmfulness of some previously widespread habits and the realization of the importance of replication has, in my impression, increased significantly among very many colleagues in the last two to three years.

Unfortunately, for formal reasons, the manusrkipt does not fit so well into the planned special issue. As I said, the aim of the special issue is to discuss topics around the replication question in a more fundamental way than is possible in the current discussions and forums, with some distance from the current debates. The article fits very well into the ongoing discussions, with which you and Mr. Heene are explicitly dealing with, but it misses the goal of the special issue. I’m sorry if there was a misunderstanding.

That in itself would not be a reason for rejection, but there is also the fact that a number of people and their contributions to the ongoing debates are critically discussed. According to the tradition of the Psychologische Rundschau, each of them would have to be given the opportunity to respond in the issue. Such a discussion, however, would go far beyond the intended scope of the thematic issue. It would also pose great practical difficulties, because of the German language, to realize this with the English-speaking authors (Ledgerwood; Feldman Barrett; Hewstone, however, I think can speak German; Gilbert). For example, you would have to submit the paper in an English version as well, so that these authors would have a chance to read the criticisms of their statements. Their comments would then have to be translated back into German for the readers of Psychologische Rundschau.

All this, I am afraid, is not feasible within the scope of the special issue in terms of the amount of space and time available. Personally, as I said, I find most of your arguments in the manuscript apt and correct. From experience, however, it is to be expected that the persons criticized will have counter-arguments, and the planned special issue cannot and should not provide such a continuation of the ongoing debates in the Psychologische Rundschau. We currently have too many discussion forums in the Psychologische Rundschau, and I do not want to open yet another one.

I ask for your understanding and apologize once again for apparently not having communicated the objective of the special issue clearly enough. I hope you and Mr. Heene will not hold this against me, even though I realize that you will be disappointed with this decision. However, perhaps the manuscript would fit well in one of the Internet discussion forums on these issues or in a similar setting, of which there are several and which are also emerging all the time. For example, I think the Fachgruppe Allgemeine Psychologie is currently in the process of setting up a new discussion forum on the replicability question (although there was also a deadline at the end of March, but perhaps the person responsible, Ms. Bermeitinger from the University of Hildesheim, is still open for contributions).

I am posting this letter now because the forced resignation of Fiedler as editor of Perspectives on Psychological Science made it salient how political publishing in psychology journals is. While many right-wing media commented on this event to support their anti-woke, pro-doze culture wars. They want to maintain the illusion that current science, I focus on psychology here, is free of ideology and only interested in searching for the truth. This is BS. Psychologists are human beings and show in-group bias. When most psychologists in power are old, White, men, they will favor old, White, men that are like them. Like all systems that work for the people in power, they want to maintain the status quo. Fiedler abused his power to defend the status quo against criticisms of a lack in diversity. He also published several articles to defend (social) psychology against accusations of shoddy practices (questionable research practices).

I am also posting it here because a very smart psychologists stated in private that he agreed with many of our critical comments that we made about replication-crisis deniers. As science is a social game, it is understandable that he never commented on this topic in public (If he doesn’t like that I am making them public, he can say that he was just polite and didn’t really mean what he wrote).

I published a peer-reviewed article on the replication crisis and the shameful response by many social psychologists several years later (Schimmack, 2020). A new generation of social psychologists is trying to correct the mistakes of the previous generation, but as so often, they do so without the support or even against the efforts of the old guard that cannot accept that many of their cherished findings may die with them. But that is life.

Klaus Fiedler is a Victim – of His Own Arrogance

One of the bigger stories in Psychological (WannaBe) Science was the forced resignation of Klaus Fiedler from his post as editor-in-chief at the prestigious journal “Perspectives on Psychological Science.” In response to his humiliating eviction, Klaus Fiedler declared “I am the victim.

In an interview, he claimed that the his actions that led to the vote of no confidence by the Board of Directors of the Association of Psychological Science (APS) were “completely fair, respectful, and in line with all journal standards.” In contrast, the Board of Directors listed several violations of editorial policies and standards.

The APS board listed the following complaints.

  • accept an article criticizing the original article based on three reviews that were also critical of the original article and did not reflect a representative range of views on the topic of the original article; 
  • invite the three reviewers who reviewed the critique favorably to themselves submit commentaries on the critique; 
  • accept those commentaries without submitting them to peer review; and, 
  • inform the author of the original article that his invited reply would also not be sent out for peer review. The EIC then sent that reply to be reviewed by the author of the critical article to solicit further comments.

As bystanders, we have to decide whether these accusations by several board members are accurate or whether these are trumped up charges that misrepresent the facts and Fiedler is an innocent victim. Even without specific knowledge about this incidence and the people involved, bystanders are probably forming an impression about Fiedler and his accusers. First, it is a natural human response to avoid embarrassment after a public humiliation. Thus, Fiedler’s claims of no wrong-doing have to be taken with a grain of salt. On the other hand, APS board members could also have motives to distort the facts, although they are less obvious.

To understand the APS board’s responses to Fiedler’s actions, it is necessary to take into account that Fiedler’s questionable editorial decisions affected Steven Roberts, an African American scholar, who had published an article about systemic racism in psychology in the same journal under a previous editor (Roberts et al., 2020). Fiedler’s decision to invite three White critical reviewers to submit their criticisms as additional commentaries was perceived by Roberts’ as racially biased. When he made his concerns public, over 1,000 bystanders agreed and signed an open letter asking for Fiedler’s resignation. In contrast, an opposing open letter received much fewer signatures. While some of the signatures on both sides have their own biases because they know Fiedler as a friend or foe, most of the signatures did not know anything about Fiedler, but reacted to Roberts’ description of his treatment. Fiedler never denied that this account was an accurate description of events. He merely claims that his actions were “completely fair, respectful, and in line with journal standards.” Yet, nobody else has supported Fiedler’s claim that it is entirely fair and acceptable to invite three White-ish reviewers to submit their reviews as commentaries and to accept these commentaries without peer-review.

I conducted an informal and unrepresentative poll that confirmed my belief that inviting reviewers to submit a commentary is rare.

What is even more questionable is that all the three reviews support with Hommel’s critical commentary of Robert’s target article. It is not clear why reviews of a commentary were needed to be published as additional commentaries if these reviews agreed with Hommel’s commentary. The main point of reviews is to determine whether a submission is suitable for publication. If Hommel’s commentary was so deficient that all three reviewers were able to make additional points that were missing from his commentary, his submission should have been rejected with or without a chance of resubmission. In short, Fiedler’s actions were highly unusual and questionable, even if they were not racially motivated.

Even if Fiedler thought that his actions were fair and unbiased when he was acting, the response by Roberts, over 1,000 signatories, and the APS board of directors could have made him realize that others viewed his behaviors differently and maybe recognize that his actions were not as fair as he assumed. He could even have apologized for his actions or at least the harm they caused however unintentional. Yet, he chose to blame others for his resignation – “I am the victim”. I believe that Fiedler is indeed a victim, but not in the way he perceives the situation. Rather than blaming others for his disgraceful resignation, he should blame himself. To support my argument, I will propose a mediation model and provide a case-study of Fiedler’s response to criticism as empirical support.

From Arrogance to Humiliation

A well-known biblical proverb states that arrogance is the cause of humiliation (“Hochmut kommt vor dem Fall). I am proposing a median model of this assumed relationship. Fiedler is very familiar with mediation models (Fiedler, Harris, & Schott, 2018). A mediation model is basically a causal chain. I propose that arrogance may lead to humiliation because it breeds ignorance. Figure 1 shows ignorance as the mediator. That is, arrogance makes it more likely that somebody is discounting valid criticism. In turn, individuals may act in ways that are not adaptive or socially acceptable. This leads to either personal harm or a damage to a person’s reputation. Arrogance and ignorance will also shape the response to social rejection. Rather than making an internal attribution that elicits feelings of embarrassed, an emotion that repairs social relationships, arrogant and ignorant individuals will make an external attribution (blame) that leads to anger, an emotion that further harms social relationships.

Fiedler’s claim that his actions were fair and that he is the victim makes it clear that he made an external attribution. He blames others, but the real problem is that Fiedler is unable to recognize when he is wrong and criticism is justified. This attributional bias is well known in psychology and called a self-serving attribution. To enhance one’s self-esteem, some individuals attribute successes to their own abilities and blame others for their failures. I present a case-study of Fiedler’s response to the replication crisis as evidence that his arrogance blinds him to valid criticism.

Replicability and Regression to the Mean

In 2011, social psychology was faced with emerging evidence that many findings, including fundamental findings like unconscious priming, cannot be replicated. A major replication project found that only 25% of social psychology studies produced a significant result again in an attempt to replicate the original study. These findings have triggered numerous explanations for the low replication rate in social psychology (OSC, 2015; Schimmack, 2020; Wiggins & Christopherson, 2019).

Explanations for the replication crisis in social psychology can be divided into two camps. One camp believes that replication failures reveal major problems with the studies that social psychologists conducted for decades. The other camp argues that replication failures are a normal part of science and that published results can be trusted even if they failed to replicate in recent replication studies. A notable difference between these two camps is that defenders of the credibility of social psychology tend to be established and prominent figures in social psychology. As a result, they also tend to be old, men, and White. However, these surface characteristics are only correlated with views about the replication crisis. The main causal factor is likely to be the threat to eminent social psychologists concerns about their reputation and legacy. Rather than becoming famous names along with Allport, their names may be used to warn future generations about the dark days when social psychologists invented theories based on unreliable results.

Consistent with the stereotype of old, White, male social psychologists, Fiedler has become an outspoken critic of the replication movement and tried to normalize replication failures. After the credibility of psychology was challenged in news outlets, the board of the German Psychological Society (DGPs) issued a reassuring (whitewashing) statement that tried to reassure the public that psychology is a science. The web page has been deleted, but a copy of the statement is preserved here (Stellungnahme). This official statement triggered outrage among some members and DGPs created a discussion forum (also deleted now). Fiedler participated in this discussion with the claim that replication failures can be explained by a statistical phenomenon known as regression to the mean. He repeated this argument in an email with a reporter that was shared by Mickey Inzlicht in the International Social Cognition Network group (ISCON) on Facebook. This post elicited many commentaries that were mostly critical of Fiedler’s attempt to cast doubt about the scientific validity of the replication project. The ISCON post and the comments were deleted (when Mickey left Facebook), but they were preserved in my Google inbox. Here is the post and the most notable comments.

Michael Inzlicht shares Fiedler’s response to the outcome of the Reproducibility Project that only 25% of significant results in social psychology could be replicated (i.e., produced a p-value below .05).

  

August 31 at 9:46am

Klaus Fiedler has granted me permission to share a letter that he wrote to a reported (Bruce Bowers) in response to the replication project. This letter contains Klaus’s words only and the only part I edited was to remove his phone number. I thought this would be of interest to the group.

Dear Bruce:

Thanks for your email. You can call me tomorrow but I guess what I have to say is summarized in this email.

Before I try to tell it like it is, I ask you to please attend to my arguments, not just the final evaluations, which may appear unbalanced. So if you want to include my statement in your article, maybe along with my name, I would be happy not to detach my evaluative judgment from the arguments that in my opinion inevitably lead to my critical evaluation.

First of all I want to make it clear that I have been a big fan of properly conducted replication and validation studies for many years – long before the current hype of what one might call a shallow replication research program. Please note also that one of my own studies has been included in the present replication project; the original findings have been borne out more clearly than in the original study. So there is no self-referent motive for me to be overly critical.

However, I have to say that I am more than disappointed by the present report. In my view, such an expensive, time-consuming, and resource-intensive replication study, which can be expected to receive so much attention and to have such a strong impact on the field and on its public image, should live up (at least) to the same standards of scientific scrutiny as the studies that it evaluates. I’m afraid this is not the case, for the following reasons …

The rationale is to plot the effect size of replication results as a function of original results. Such a plot is necessarily subject to regression toward the mean. On a-priori-grounds, to the extent that the reliability of the original results is less than perfect, it can be expected that replication studies regress toward weaker effect sizes. This is very common knowledge. In a scholarly article one would try to compare the obtained effects to what can be expected from regression alone. The rule is simple and straightforward. Multiply the effect size of the original study (as a deviation score) with the reliability of the original test, and you get the expected replication results (in deviation scores) – as expected from regression alone. The informative question is to what extent the obtained results are weaker than the to-be-expected regressive results.

To be sure, the article’s muteness regarding regression is related to the fact that the reliability was not assessed. This is a huge source of weakness. It has been shown (in a nice recent article by Stanley & Spence, 2014, in PPS) that measurement error and sampling error alone will greatly reduce the replicability of empirical results, even when the hypothesis is completely correct. In order not to be fooled by statistical data, it is therefore of utmost importance to control for measurement error and sampling error. This is the lesson we took from Frank Schmidt (2010). It is also very common wisdom.

The failure to assess the reliability of the dependent measures greatly reduces the interpretation of the results. Some studies may use single measures to assess an effect whereas others may use multiple measures and thereby enhance the reliability, according to a principle well-known since Spearman & Brown. Thus, some of the replication failures may simply reflect the naïve reliance on single-item dependent measures. This is of course a weakness of the original studies, but a weakness different from non-replicability of the theoretically important effect. Indeed, contrary to the notion that researchers perfectly exploit their degrees of freedom and always come up with results that overestimate their true effect size, they often make naïve mistakes.

By the way, this failure to control for reliability might explain the apparent replication advantage of cognitive over social psychology. Social psychologists may simply often rely on singular measure, whereas cognitive psychologists use multi-trial designs resulting in much higher reliability.

The failure to consider reliability refers to the dependent measure. A similar failure to systematically include manipulation checks renders the independent variables equivocal. The so-called Duhem-Quine problem refers to the unwarranted assumption that some experimental manipulation can be equated with the theoretical variable. An independent variable can be operationalized in multiple ways. A manipulation that worked a few years ago need to work now, simply because no manipulation provides a plain manipulation of the theoretical variable proper. It is therefore essential to include a manipulation check, to make sure that the very premise of a study is met, namely a successful manipulation of the theoretical variable. Simply running the same operational procedure as years before is not sufficient, logically.

Last but not least, the sampling rule that underlies the selection of the 100 studies strikes me as hard to tolerate. Replication teams could select their studies from the first 20 articles published in a journal in a year (if I correctly understand this sentence). What might have motivated the replication teams’ choices? Could this procedure be sensitive to their attitude towards particular authors or their research? Could they have selected simply studies with a single dependent measure (implying low reliability)? – I do not want to be too suspicious here but, given the costs of the replication project and the human resources, does this sampling procedure represent the kind of high-quality science the whole project is striving for?

Across all replication studies, power is presupposed to be a pure function of the size of participant samples. The notion of a truly representative design in which tasks and stimuli and context conditions and a number of other boundary conditions are taken into account is not even mentioned (cf. Westfall & Judd).

Comments

Brent W. Roberts, 10:02am Sep 4
This comment just killed me “What might have motivated the replication teams’ choices? Could this procedure be sensitive to Their attitude towards Particular authors or Their research?” Once again, we have an eminent, high powered scientist impugning the integrity of, in this case, close to 300, mostly young researchers. What a great example to set.

Daniel Lakens, 12:32pm Sep 4
I think the regression to the mean comment just means: if you start from an extreme initial observation, there will be regression to the mean. He will agree there is publication bias – but just argues the reduction in effect sizes is nothing unexpected – we all agree with that, I think. I find his other points less convincing – there is data about researchers expectencies about whether a study would replicate. Don’t blabla, look at data. The problem with moderators is not big – original researchers OKéd the studies – if they can not think of moderators, we cannot be blamed for not including others checks. Finally, it looks like our power was good, if you examine the p-curve. Not in line with the idea we messed up. I wonder why, with all commentaries I’ve seen, no one takes the effort to pre-register their criticisms, and then just look at the studies and data, and let us know how much it really matters?

Felix Cheung, ,2:11pm Sep 4
I don’t understand why the regression to mean cannot be understood in a more positive light when the “mean” in regression to the mean refers to the effect sizes of interests. If that’s the case, then regressing to mean would mean that we are providing more accurate estimates of the effect sizes.

Joachim Vandekerckhove, 2:15pm Aug 31
The dismissive “regression to the mean” argument either simply takes publication bias as given or assumes that all effect sizes are truly zero. Either of those assumptions make for an interesting message to broadcast, I feel.

Michael Inzlicht, 2:54pm Aug 31
I think we all agree with this, Jeff, but as Simine suggested, if the study in question is a product of all the multifarious biases we’ve discussed and cannot be replicated (in an honest attempt), what basis do we have to change our beliefs at all? To me the RP–plus lots of other stuff that has come to light in the past few years–make me doubt the evidentiary basis of many findings, and by extension, many theories/models. Theories are based on data…and it turns out that data might not be as solid as we thought.

Jeff Sherman, 2:58pm Aug 31
Michael, I don’t disagree. I think RP–plus was an important endeavor. I am sympathetic to Klaus’s lament that the operationalizations of the constructs weren’t directly validated in the replications.

Uli Schimmack, 11:15am Sep 1
This is another example that many psychologists are still trying to maintain the illusion that psychology doesn’t have a replicabiltiy problem.
A recurrent argument is that human behavior is complex and influenced by many factors that will produce variation in results across seemingly similar studies.
Even if this were true, it would not explain why all original studies find significant effects. If moderators can make effects appear or disappear, there would be an equal number of non-significant results in original and replication studies. If psychologists were really serious about moderating factors, non-significant results would be highly important to understand under what conditions an effect does not occur. The publication of only significant results in psychology (since 1959 Sterling) shows that psychologists are not really serious about moderating factors and that moderators are only invoked post-hoc to explain away failed replications of significant results.
Just like Klaus Fiedler’s illusory regression to the mean, these arguments are hollow and only reveal the motivated biases of their proponents to deny a fundamental problem in the way psychologists collect, analyze, and report their research findings.
If a 25% replication rate for social psychology is not enough to declare a crisis then psychology is really in a crisis and psychologists provide the best evidence for the validity of Freud’s theory of repression. Has Daniel Kahneman commented on the reproducibility-project results?

Garriy Shteynberg, 10:33pm Sep 7
Again, I agree that there is publication bias and its importance even in a world where all H0 are false (as you show in your last comment). Now, do you see that in that very world, regression to the mean will still occur? Also, in the spirit of the dialogue, try to refrain from claiming what others do not know. I am sure you realize that making such truth claims on very little data is at best severely underpowered.

Uli Schimmack, 10:38pm Sep 7
Garriy Shteynberg Sorry, but I always said that regression to the mean occurs when there is selection bias, but without selection bias it will not occur. That is really the issue here and I am not sure what point you are trying to make. We agree that studies were selected and that low replication rate is a result of this selection and regression to the mean. If you have any other point to make, you have to make it clearer.

Malte Elson, 3:38am Sep 8
Garriy Shteynberg would you maybe try me instead? I followed your example of the perfect discipline with great predictions and without publication bias. What I haven’t figured out is what would cause regression to the mean to only occur in one direction (decreased effect size at replication level). The predictions are equally great at both levels since they are exactly the same. Why would antecedent effect sizes in publications be systematically larger if there was no selection at that level?

Marc Halusic, 12:53pm Sep 1
Even if untold moderators affect the replicability of a study that describes a real effect, it would follow that any researcher who cannot specify the conditions under which an effect will replicate does not understand that effect well enough to interpret it in the discussion section.

Maxim Milyavsky, 11:16am Sep 3
I am not sure whether Klaus meant that regression to mean by itself can explain the failure of replication or regression to mean given a selection bias. I think that without selection bias regression to mean cannot count as an alternative explanation. If it could, every subsequent experiment would yield a smaller effect than the previous one, which sounds like absurd. I assume that Klaus knows that. So, probably he admits that there was a selection bias. Maybe he just wanted to say – it’s nobody’s fault. Nobody played with data, people were just publishing effects that “worked”. Yet, what is sounds puzzling to me is that he does not see any problem in this process.

– Mickey shared some of the responses with Klaus and posted Klaus’s responses to the comment. Several commentators tried to defend Klaus by stating that he would agree with the claim that selection for significance is necessary to see an overall decrease in effect sizes. However, Klaus Fiedler doubles down on the claim that this is not necessary even though the implication would be that effect sizes shrink every time a study is replicated which is “absurd” (Maxim Milyavsk), although even this absurd claim has been made (Schooler, 2011).

Michael Inzlicht, September 2 at 1:08pm

More from Klaus Fiedler. He has asked me to post a response to a sample of the replies I sent him. Again, this is unedited, directly copying and pasting from a note Klaus sent me. (Also not sure if I should post it here or the other, much longer, conversation).

Having read the echo to my earlier comment on the Nosek report, I got the feeling that I should add some more clarifying remarks.

(1) With respect to my complaints about the complete failure to take regressiveness into account, some folks seem to suggest that this problem can be handled simply by increasing the power of the replication study and that power is a sole function of N, the number of participants. Both beliefs are mistaken. Statistical power is not just a function of N, but also depends on treating stimuli as a random factor (cf. recent papers by Westfall & Judd). Power is 1 minus ?, the probability that a theoretical hypothesis, which is true, will be actually borne out in a study. This probability not only depends on N. It also depends on the appropriateness of selected stimuli, task parameters, instructions, boundary conditions etc. Even with 1000 participant per cell, measurement and sampling error can be high, for instance, when a test includes weakly selected items, or not enough items. It is a cardinal mistake to reduce power to N.

(2) The only necessary and sufficient condition for regression (to the mean or toward less pronounced values) is a correlation less than zero. This was nicely explained and proven by Furby (1973). We all “learned” that lesson in the first semester, but regression remains a counter-intuitive thing. When you plot effect sizes in the replication studies as a function of effect sizes in the original studies and the correlation between corresponding pairs is < 1, then there will be regression. The replication findings will be weaker than the original ones. One can refrain from assuming that the original findings have been over-estimations. One might represent the data the other way around, plotting the original results as a function of given effects in the replication studies, and one will also see regression. (Note in this connection that Etz’ Bayesian analysis of the replication project also identified quite a few replications that were “too strong”). For a nice illustration of this puzzling phenomenon, you may also want to read the Erev, Wallsten & Budescu (1994) paper, which shows both overconfidence and underconfidence in the same data array.

(3) I’m not saying that regression is easy to understand intuitively (Galton took many years to solve the puzzle). The very fact that people are easily fooled by regression is the reason why controlling for expected regression effects is standard in the kind of research published here. It is almost a prototypical example of what Don Campbell (1996) had in mind when he tried to warn the community from drawing erroneous inferences.

(4) I hope it is needless to repeat that controlling for the reliability of the original studies is essential, because variation in reliability affects the degree of regressiveness. It is particularly important to avoid premature interpretations of seemingly different replication results (e.g., for cognitive and social psychology) that could reflect nothing but unequal reliability.

(5) My critical remark that the replication studies did not include manipulation checks was also met with some spontaneous defensive reactions. Please note that the goal to run so-called “exact” replications (I refrain from discussing this notion here) does not prevent replication researchers from including additional groups supposed to estimate the effectiveness of a manipulation under the current conditions. (Needless to add that a manipulation check must be more than a compliant repetition of the instruction).

(6) Most importantly perhaps, I would like to reinforce my sincere opinion that methodological and ethical norms have to be applied to such an expensive, pretentious and potentially very consequential project even more carefully and strictly than they are applied to ordinary studies. Hardly any one of the 100 target studies could have a similarly strong impact, and call for a similar degree of responsibility, as the present replication project.

Kind regards, Klaus

This response elicited an even more heated discussion. Unfortunately, only some of these comments were mailed to my inbox. I must have made a very negative comment about Klaus Fiedler that elicited a response by Jeff Sherman, the moderator of the group. Eventually, I was banned from the group and created the Psychological Methods Discussion Group. that became the main group for critical discussion of psychological science.

Uli Schimmack, 2:36pm Sep 2
Jeff Sherman The comparison extends to the (in German) official statement regarding the results of the OSF-replication project. It does not mention that publication bias is at least a factor that contributed to the outcome or mentions any initiatives to improve the way psychologists conduct their research. It would be ironic if a social psychologists objects to a comparison that is based on general principles of social behavior.
I think I don’t have to mention that the United States of America pride themselves on freedom of expression that even allows Nazis to publish their propaganda which German law does not allow. In contrast, censorship was used by socialist Germany to maintain in power. So, please feel free to censor my post. and send me into Psychological Method exile.

Jeff Sherman, 2:49pm Sep 2
Uli Schimmack I am not censoring the ideas you wish to express. I am saying that opinions expressed on this page must be expressed respectfully.
Calling this a freedom of speech issue is a red herring. Ironic, too, given that one impact of trolling and bullying is to cause others to self-censor.
I am working on a policy statement. If you find the burden unbearable, you can choose to not participate.

Uli Schimmack, 2:53pm Sep 2
Jeff Sherman Klaus is not even part of this. So, how am I bullying him? Plus, I don’t think Klaus is easily intimidated by my comment. And, as a social psychologist how do you explain that Klaus doubled down when every comment pointed out that he ignores the fact that regression to the mean can only produce a decrease in the average if the original sample was selected to be above the mean?

This discussion led to a letter to the DGPs board by Moritz Heene that expressed outrage about the whitewashing of the replication results in their official statement.

From: Moritz Heene
To: Andrea Abele-Brehm, Mario Gollwitzer, & Fritz Strack
Subject: DGPS-Stellungnahme zu Replikationsprojekt
Date: Wed, 02 Sep 2015

[I suggest to copy and past the German text into DeepL, a powerful translation program]

Sehr geehrte Mitglieder des Vorstandes der DGPS,

Zunächst Dank an Sie für das Bemühen, die Ergebnisse des OSF-Replikationsprojektes der Öffentlichkeit klarer zu machen. Angesichts dieser Stellungnahme der DGPS möchte ich jedoch persönlich meinen Widerspruch dazu ausdrücken, da ich als Mitglied der DGPS durch diese Stellungnahmen in keiner Weise eine ausgewogene Sichtweise ausgedrückt sehe, sie im Gegenteil als sehr einseitig empfinde. Ich sehe diese Stellungnahme vielmehr als einen Euphemismus der Replikationsproblematik in der Psychologie an, um es milde auszudrücken, bin davon enttäuscht und hatte mir mehr erwartet.
Meine Kritikpunkte an ihrer Stellungnahme:

1. Zum Argument 68% der Studien seien repliziert worden: Der Test dazu prüft, ob der replizierte Effekte im Konfidenzintervall um den originalen Effekt liegt, ob diese also signifikant voneinander verschieden sind, so die Logik der Autoren. Lassen wir mal großzügig beiseite, dass dies kein Test über die Differenz der Effektgrößen ist, da das Konfidenzintervall um den originalen beobachteten Effekt gelegt wird, nicht um die Differenz. Wesentlicher ist, dass dies ein schlechtes Maß für Replizierbarkeit ist, denn die originalen Effekte sind upward biased (sieht man in dem originalen paper auch), und vergessen wir den publication bias nicht (siehe density distribution der p-Werte im originalen paper). Anzunehmen, dass die originalen Effektgrößen die Populationseffektgrößen sind, ist wirklich eine heroische Annahme, gerade angesichts des positiven bias der originalen Effekte. Nebenbei: In einem offenen Brief von Klaus Fiedler auf Facebook dazu publiziert wurde, wird argumentiert, die Regression zur Mitte habe die im Schnitt geringeren Effektgrößen im OSF-Projekt produziert, könne diesen Effekt erklären. Dieses Argument mag teilweise stimmen, impliziert aber, dass die originalen Effekte extrem (also biased, weil selektiv publiziert wurde) waren, denn genau das ist ja das Charakteristikum dieses Regressionseffektes: Ergebnisse, die in einer ersten Messung extrem waren, “tendieren” in einer zweiten Messung zum Mittelwert. Die Tatsache, dass die originalen Effekte einen deutlichen positiven bias aufweisen, wird in Ihrer Stellungnahme ignoriert, bzw. gar nicht erst erwähnt.

Das Argument der 68%-Replizierbarkeit wird im übrigen auch vom Hauptautor in Antwort auf ihre Stellungnahme ganz offen in ähnlicher Weise kritisiert:

https://twitter.com/BrianNosek/status/639049414947024896

Kurzum: Sich genau diese Statistik als Unterstützung dafür aus der OSF-Studie herauszusuchen, um der Öffentlichkeit zu erklären, dass in der Psychologie im Grunde alles in Ordnung ist, sehe ich als “cherry picking” von Ergebnissen an.

2. Das Moderatoren-Argument ist letztlich unhaltbar, denn erstens > wurde dies insbesondere im OSF-Projekt 3 intensiv getestet. Das Ergebnis ist u.a. hier zusammengefasst:

https://hardsci.wordpress.com/2015/09/02/moderator-interpretations-of-the-reproducibility-project/

Siehe u.a.:
In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting. Länger findet man es hier zusammengefasst:

https://hardsci.wordpress.com/2015/03/12/an-open-review-of-many-labs-3-much-to-learn

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:
A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.
Zweitens schreiben Sie In ihrer Stellungnahme: Solche Befunde zeigen vielmehr, dass psychologische Prozesse oft kontextabhängig sind und ihre Generalisierbarkeit weiter erforscht werden muss. Die Replikation einer amerikanischen Studie erbringt möglicherweise andere Ergebnisse, wenn diese in Deutschland oder in Italien durchgeführt wird (oder umgekehrt). In ähnlicher Weise können sich unterschiedliche Merkmale der Stichprobe (Geschlechteranteil, Alter, Bildungsstand, etc.) auf das Ergebnis auswirken. Diese Kontextabhängigkeit ist kein Zeichen von fehlender Replizierbarkeit, sondern vielmehr ein Zeichen für die Komplexität psychologischer Phänomene und Prozesse.
Nein, das zeigen diese neuen Befunde eben nicht, denn dies ist eine (Post-hoc-)Interpretation die durch die im neuen OSF-Projekt erhobenen Moderatoren nicht unterstützt wird, da diese Moderatorenanalysen gar nicht durchgeführt wurden. Die postulierte Kontextabhängigkeit wurde zudem im OSF-Projekt #3 nicht gefunden. Was man zwischen den labs als Variationsquelle fand war schlicht und einfach Stichprobenvariation, wie man sie nun mal in der Statistik erwarten muss. Ich sehe für Ihre Behauptung also gar keine empirische Basis, wie sie doch in einer sich empirisch nennenden Wissenschaft doch vorhanden sein sollte.
Was mir als abschließende Aussage in der Stellungnahme deutlich fehlt ist, dass die Psychologie (und gerade die Sozialpsychologie) in Zukunft keine selektiv publizierten und “underpowered studies” mehr akzeptieren sollte. Das hätte den Kern des Problems etwas besser getroffen.
Mit freundlichen Grüßen,
Moritz Heene

Moritz Heene received the following response from one of the DGPs board members.

From: Mario Gollwitzer
To: Moritz Heene
Subject: Re: DGPS-Stellungnahme zu Replikationsprojekt
Date: Thu, 03 Sep 2015 10:19:28 +0200

Lieber Moritz,  

vielen Dank für deine Mail — sie ist eine von vielen Rückmeldungen, die uns auf unsere Pressemitteilung vom Montag hin erreicht hat, und wir finden es sehr gut, dass in der DGPs-Mitgliedschaft dadurchoffenbar eine Diskussion angestoßen wurde. Wir glauben, dass diese Diskussion offen geführt werden sollte; daher haben wir uns entschlossen, zu unserer Pressemitteilung (und der Science-Studie bzw. dem ganzen Replikations-Projekt) eine Art Diskussionsforum auf unserer DGPs-Homepage einzurichten. Wir arbeiten gerade daran, die Seite aufzubauen. Ich fände es gut, wenn auch du dich hier beteiligen würdest, gerne mit deiner kritischen Haltung gegenüber unserer Pressemitteilung.

Deine Argumente kann ich gut nachvollziehen — und ich stimme dir zu, dass die Zahl “68%” nicht einen “Replikationsanteil” wiederspiegelt. Das war eine missverständliche Äußerung.

Aber abgesehen davon war unser Ziel, mit dieser Pressemitteilung den negativen, teilweise hämischen und destruktiven Reaktionen vieler Medien auf die Science-Studie etwas Konstruktives hinzuzufügen bzw. entgegenzusetzen. Keineswegs wollten wir die Ergebnisse der Studie”schönreden” oder eine Botschaft im Sinne von “alles gut, business as usual” verbreiten! Vielmehr wollten wir argumentieren, dass Replikationsversuche wie diese die Chance auf einen Erkenntnisgewinn bieten, die man nutzen sollte. Das ist die konstruktive Botschaft, die wir gerne auch ein bisschen stärker in den Medien vertreten sehen wollen.

Anders als du bin ich allerdings der Überzeugung, dass es durchaus möglich ist, dass die Unterschiede zwischen einer Originalstudie undihren Replikationen durchaus durch eine (unbekannte) Menge (teilweise bekannter, teilweise unbekannter) Moderatorvariablen (und deren Interaktionen) zustande kommen. Auch “Stichprobenvariation” ist nicht anderes als ein Sammelbegriff für solche Moderatoreffekte. Einige dieser Effekte sind für den Erkenntnisgewinn über ein psychologisches Phänomen zentral, andere nicht. Es gilt, die zentralen Effekte besser zu beschreiben und zu erklären. Darin sehe ich auch einen Wert von Replikationen, insbesondere von konzeptuellen Replikationen.  

Abgesehen davon bin ich aber mit dir völlig einer Meinung, dass man nicht ausschließen kann, dass einige der nicht-replizierbaren, aber publizierten Effekte — übrigens nicht bloß in der Sozialpsychologie, sondern in allen Disziplinen — falsch Positive sind, für die es eine Reihe von Gründen gibt (selektives Publizieren, fragwürdige Auswertungspraktiken etc.), die hoch problematisch sind. Über diese Dinge wird ja andernorts auch heftig diskutiert. Diese Diskussionwollten wir aber in unserer Pressemitteilung erst einmal beseite lassen und stattdessen speziell auf die neue Science-Studiefokussieren.

Nochmals vielen Dank für deine Email. Solche Reaktionen sind für uns ein wichtiger Spiegel unserer Arbeit.

Herzliche Grüße, Mario

After the DGPs created a discussion forum, Klaus Fiedler, Moritz Heene and I shared our exchange of views openly on this site. The website is no longer available, but Moritz Heene saved a copy. He also shared our contribution on The Winnower.

RESPONSE TO FIEDLER’S POST ON THE REPLICATION
We would like to address the two main arguments in Dr. Fiedler’s post on https://www.dgps.de/index.php?id=2000735

1), that the notably lower average effect size in the OSF-project are a statistical artifact of regression to the mean,

2) that low reliability contributed to the lower effect sizes in the replication studies.

Response to 1) as noted in Heene’s previous post, Fiedler’s regression to the mean argument (results that were extreme in a first assessment tend to be closer to the mean in a second assessment) implicitly assumes that the original effects were biased; that is, they are extreme estimates of population effect sizes because they were selected for publication. However, Fiedler does not mention the selection of original effects, which leads to a false interpretation of the OSF-results in Fiedler’s commentary:

“(2) The only necessary and sufficient condition for regression (to the mean or toward less pronounced values) is a correlation less than zero. … One can refrain from assuming that the original findings have been over-estimations.” (Fiedler)

It is NOT possible to avoid the assumption that original results are inflated estimates because selective publication of results is necessary to account for the notable reduction in observed effect sizes.

a) Fiedler is mistaken when he cites Furby (1973) as evidence that regression to the mean can occur without selection. “The only necessary and sufficient condition for regression (to the mean or toward less pronounced values) is a correlation less than zero. This was nicely explained and proven by Furby (1973)” (Fiedler). It is noteworthy that Furby (1973) explicitly mentions a selection above or below the population mean in his example, when Furby (1973) writes: “Now let us choose a certain aggression level at Time 1 (any level other than the mean)”.

The math behind regression to the mean further illustrates this point. The expected amount of regression to the mean is defined as (1 – r)(mu – M), where r = correlation between first and second measurement, mu: population mean, and M = mean of the selected group (sample at time 1). For example, if r = .80 (thus, less than 1 as assumed by Fiedler) and the observed mean in the selected group (M) equals the population mean (mu) (e.g., M = .40, mu = .40, and M – mu = .40 – .40 = 0), no regression to the mean will occur because (1 – .80)(.40-.40) = .20*0 = 0. Consequently, a correlation less than 1 is not a necessary and sufficient condition for regression to the mean. The effect occurs only if the correlation is less than 1 and the sample mean differs from the population mean. [Actually the mean will decrease even if the correlation is 1, but individual scores will maintain their position relative to other scores]

b) The regression to the mean effect can be positive or negative. If M < mu and r < 1, the second observations would be higher than the first observations, and the trend towards the mean would be positive. On the other hand, if M > mu and r < 1, the regression effect is negative. In the OSF-project, the regression effect was negative, because the average effect size in the replication studies was lower than the average effect size in the original studies. This implies that the observed effects in the original studies overestimated the population effect size (M > mu), which is consistent with publication bias (and possibly p-hacking).

Thus, the lower effect sizes in the replication studies can be explained as a result of publication bias and regression to the mean. The OSF-results make it possible to estimate, how much publication bias inflates observed effect sizes in original studies. We calculated that for social psychology the average effect size fell from Cohen’s d = .6 to d = .2. This shows inflation by 200%. It is therefore not surprising that the replication studies produced so few significant results because the increase in sample size did not compensate for the large decrease in effect sizes.

Regarding Fiedler’s second point 2)

In a regression analysis, the observed regression coefficient (b) for an observed measure with measurement error is a function of the true relationship (bT) and an inverse function of the amount of measurement error (1 – error = reliability; Rel(X)):

                                                     

(Interested readers can obtain the mathematical proof from Dr. Heene).

The formula implies that an observed regression coefficient (and other observed effect sizes) is always smaller than the true coefficient that could have been obtained with a perfectly reliable measure, when the reliability of the measure is less than 1. As noted by Dr. Fiedler, unreliability of measures will reduce the statistical power to obtain a statistically significant result. This statistical argument cannot explain the reduction in effect sizes in the replication studies because unreliability has the same influence on the outcome in the original studies and the replication studies. In short, the unreliability argument does not provide a valid explanation for the low success rate in the OSF-replication project.

REFERENCES
Furby, L. (1973). Interpreting regression toward the mean in developmental research. Developmental Psychology, 8(2), 172-179. doi:10.1037/h0034145

On September 5, Klaus Fiedler emailed me to start a personal discussion over email.

From: klaus.fiedler [klaus.fiedler@psychologie.uni-heidelberg.de]
Sent: September-05-15 7:17 AM
To: Uli Schimmack; kf@psychologie.uni-heidelberg.de
Subject: iscon gossip

Dear Uli … auf Deutsch … lieber Uli,

Du weisst vielleicht, dass ich nicht fuer Facebook registriert bin, aber ich kriege gelegentlich von anderen Notizen aus dem Chat geschickt. Du bist der Einzige, dem ich mal kurz schreibe. Du hattest geschrieben, dass meine Kommentare falsch waren und ich deshalb keinerlei Repsekt mehr verdiene.

Du bist ein methodisch motivierter und versierter Kollege, und ich waere daher sehr dankbar, wenn Du mir sagen koenntest, inwiefern meine Punkte nicht zutreffen. Was ist falsch:

— dass es die regression trap gibt?
— dass eine state-of-the art Studie der Art Retest = f(Test) für Regression kontrollieren muss?
— dass Regression eine Funktion der Reliabilitaet ist?
— dass allein ein hohes participant N keineswegs dieses Problem behebt?
— dass ein fehlender manipulation check die zentral Praemisse unterminiert, dass die UV ueberhaupt hergestellt wurde?
— dass fehlende Kontrolle von measurement + sampling error die Interpretation der Ergebnisse unterminiert?

Oder ist der Punkt, dass scientific scrutiny nicht mehr zaehlt, wenn “junge Leute” fuer eine “gute Sache” kaempfen?

Sorry, die letzte Frage driftet ein bisschen ab ins Polemische. Das war nicht so gemeint. Ich moechte wirklich wissen, warum ich falsch liege, dann wuerde ich das auch gern richtigstellen. Ich habe doch nicht behauptet, dass ich empirische Daten habe, die den Vergleich von kognitiver und sozialer Psychologie erhellen (obwohl es stimmt, dass man den Vergleich nur machen kann, wenn man Reliabilitaet und Effektivitaet der Manipulationen kontrolliert). Was mich motiviert, ist lediglich das Ziel, dass auch Meta-Science (und gerade Meta-Science) denselben strengen Standards unterliegt wie jene Forschung, die sie bewertet (und oft leichtfertig schaedigt).

Was die Sozialpsychologie angeht, so hast Du sicher schon gemerkt, dass ich auch ihr Kritiker bin … Vielleicht koennen wir uns ja mal darueber unterhalten …

Schoene Gruesse aus Heidelberg, Klaus

I responded to this email and asked him directly to comment on selection bias as a reasonable explanation for the low replicability of social psychology results.

Dear Klaus Fiedler,

Moritz Heene and I have written a response to your comments posted on the DGPS website, which is waiting for moderation.
I cc Moritz so that he can send you the response (in German), but I will try to answer your question myself.

First, I don’t think it was good that Mickey posted your comments. I think it would have been better to communicate directly with you and have a chance
to discuss these issues in an exchange of arguments. It is also
unfortunate that I mixed my response to the official DGPSs statement with your comments. I see some similarities, but you expressed a personal opinion and did not use the authority of an official position to speak for all psychologists when many psychologists disagree with the statement, which led to the post-hoc creation of a discussion forum to find out about members’ opinions on this issue.

Now let me answer your question. First, I would like to clarify that we are trying to answer the same question. To me the most important question is why the reproducibility of published results in psychology journals is so low (it is only 8% for social psychology, see my post https://replicationindex.wordpress.com/2015/08/26/predictions-about-replicat
ion-success-in-osf-reproducibility-project/ )?

One answer to this question is publication bias. This argument has been made since Sterling (1959). Cohen (1962) estimated the replication rate at 60% based on his analysis of typical effect sizes and sample sizes in Journal of Abnormal and Social Psychology (now JPSP). The 60% estimate has been replicated by Sedlmeier and Giegerenzer (1989). So, with this figure in
mind we could have expected that 60 out of 100 randomly selected results in JPSP would replicate. However, the actual success rate for JPSP is much lower. How can we explain this?

For the past five years I have been working on a better method to estimate post-hoc power, starting with my Schimmack (2012) Psych Method paper, followed by publications on my R-Index website. Similar work has been conducted by Simonsohn (p-curve) and Wicherts
(puniform) approach. The problem with the 60% estimate is that it uses reported effect sizes which are inflated. After correcting for information, the estimated power for social psychology studies in the OSF-project is only 35%. This still does not explain why only 8% were replicated and I think it is an interesting question how much moderators or mistakes in the replication study explain this discrepancy. However, a low replication rate of 35% is entirely predicted based on the published result after taking power and publication bias into account.

In sum, it is well established and known that selectin of significant results distorts the evidence in the published literature and that this creates a discrepancy between the posted success rate (95%) and the replication rate (let’s say less than 50% to be conservative). I would be surprised if you would disagree with my argument that (a) publication bias is present and (b) that publication bias at least partially contributes to the low rate of successful replications in the OSF-project.

A few days later, I sent a reminder email.

Dear Klaus Fiedler,

I hope you received my email from Saturday in reply to your email “iscon gossip”. It would be nice if you could confirm that you received it and let me know whether you are planning to respond to it.

Best regards,
Uli Schimmack

Klaus Fiedler responds without answering my question about the fact that regression to the mean can only explain a decrease in the mean effect sizes if the original values were inflated by selection for significance.

Hi:

as soon as my time permits, I will have a look. Just a general remark in response to your email, I do not undersatand what argument applies to my critical evaluation of the Nosek report. What you are telling me in the email does not apply to my critique.

Or do you contest that

  • a state-of the art study of retest = f(original test) has to tackle the regression beast
  • reliability of the dependent measure has to be controlled
  • manipulation check is crucial to assess the effective variation of the independent variable
  • the sampling of studies was suboptimal

If you disagree, I wonder if there is any common ground in scientific methodology.

I am not sure if I want to contribute to Facebook debates … As you can see, the distance from a scientitic argument to personal attacks is so short that I do not believe in the value of such a forum

Kind regards, Klaus

P.S. If I have a chance to read what you have posted, I may send a reply to the DPGs. By the way, I just sent my comments to Andrea Abele Brehm.
I did not ask her to publicize it. But that’s OK

As in a chess game, I am pressing my advantage – Klaus Fiedler is clearly alone and wrong with his immaculate regression argument – in a follow up email.

Dear Klaus Fiedler,

I am waiting for a longer response from you, but to answer your question I find it hard to see how my comments are irrelevant as they are challenge direct quotes from your response.

My main concern is that you appear to neglect the fact that regression to the mean can only occur when selection occurred in the original set of studies.

Moritz Heene and I responded to this claim and find that it is invalid.  If the original studies were not a selection of studies, the average mean should be an estimate of the average population mean and there would be no reason to expect a dramatic decrease in effect size in the OSF replication studies.  Let’s just focus on this crucial point.

You can either maintain that selection is not necessary and try to explain how regression to the mean can occur without selection or you can concede that selection is necessary and explain how the OSF replication study should have taken selection into account.  At a minimum, it would be interesting to hear your response to our quote of Furby (1973) that shows he assumed selection, while you cite Furby as evidence that selection is not necessary.

Although we may not be able to settle all disputes, we should be able to determine whether Furby assumed selection or not.

Here are my specific responses to your questions. 

– a state-of the art study of retest = f(original test) has to tackle the regression beast   [we can say that it tackeled it by examining how much selection contributed to the original results by seeing how much means regressed towards a lower mean of population effect sizes. 

Result:  there was a lot of selection and a lot of regression.

– reliability of the dependent measure has to be controlled

in a project that aims to replicate original studies exactly, reliability is determined by the methods of the original study

– manipulation check is crucial to assess the effective variation of the independent variable

sure, we can question how good the replication studies were, but adding additional manipulation checks might also introduce concerns that the study is not an exact replication.  Nobody is claiming that the replication studies are conclusive, but no study can assure that it was a perfect study.

– the sampling of studies was suboptimal

how so?  The year was selected at random.  To take the first studies in a year was also random.  Moreover it is possible to examine whether the results are representative of other studies in the same journals and they are; see my blog

You may decide that my responses are not satisfactory, but I would hope that you answer at least one of my questions: Do you maintain that the OSF-results could have been obtained without selection of results that overestimate the true population effect sizes (a lot)?

Sincerely,

Uli Schimmack

Moritz Heene comments.

Thanks, Uli! Don’t let them get away by tactically ignoring these facts.
BTW, since we share the same scientific rigor, as far as I can see, we could ponder about a possible collaboration study. Just an idea. [This led to the statistical examination of Kahneman’s book Thinking: Fast and Slow]

Regards, Moritz

Too busy to really think about the possibility that he might have been wrong, Fiedler sends a terse response.

Klaus Fiedler

Very briefly … in a mad rush this morning: This is not true. A necessary and sufficient condition for regression is r < 1. So if the correlation between the original results and the replications is less than unity, there will be regression. Draw a scatter plot and you will easily see. An appropriate reference is Furby (1973 or 1974).

I try to clarify the issue in another attempt.

Dear Klaus Fiedler,

The question is what you mean by regression. We are talking about the mean at time 1 and time 2.

Of course, there will be regression of individual scores, but we are interested in the mean effect size in social psychology (which also determines power and percentage of significant results given equal N).

It is simply NOT true that the mean will change systematically unless there
is systematic selection of observations.

As regression to the mean is defined by (1- r) * (mu – M), the formula implies that a selection effect (mu – M unequal 0) is necessary. Otherwise the whole term becomes 0.

There are three ways to explain mean differences between two sets of exact replication studies.
The original set was selected to produce significant results. The replication studies are crappy and failed to reproduce the same conditions. Random sampling error (which can be excluded because the difference in OSF is highly significant).

In the case of the OSF replication studies, selection occurred because the published results were selected to be significant from a larger set of results with non-significant results.

If you see another explanation, it would be really helpful if you would elaborate on your theory.

Sincerely,
Uli Schimmack

Moritz Heene joins the email exchange and makes a clear case that Fiedler’s claims are statistically wrong.

Dear Klaus Fiedler, dear Uli,

Just to add another clarification:

Once again, Furby (1973, p.173, see attached file) explicitly mentioned selection: “Now let us choose a certain aggression level at Time 1 (any level other than the mean) and call it x’ “.

Furthermore, regression to the mean is defined by (1- r)*(mu – M). See Shepard and Finison (1983, p.308, eq. [1]): “The term in square brackets, the product of two factors, is the estimated reduction in BP [blood pressure] due to regression.”

Now let us fix terms:

Definition of necessity and sufficiency

Necessity:
~p –> ~q , with “~” denoting negation

So, if r is not smaller than 1 than regression to the mean does not occur.

This is true as can be verified by the formula.

Sufficiency:
p –> q

So, if r is smaller than 1 than regression to the mean does occur. This is not true as can be verified by the formula as explained in our reply on https://www.dgps.de/index.php?id=2000735#c2001225 and in Ulrich’s previous email.

Sincerely,

Moritz Heene

I sent another email to Klaus to see whether he is going to respond.

Lieber Dr. Fiedler,

Kann ich noch auf eine Antwort von Ihnen warten oder soll ich annehmen dass Sie sich entschieden haben nicht auf meine Anfrage zu antworten?

LG, Uli Schimmack

Klaus Fiedler does respond.

Dear Ullrich:

Yes, I was indeed very, very busy over two weeks, working for the Humboldt foundation, for two conferences where I had to play leading roles, the Leopoldina Academy, and many other urgent jobs. Sorry but this is simply so.
I now received your email reminder to send you my comments to what you and Moritz Heene have written. However, it looks like you have already committed yourself publicly (I was sent this by colleagues who are busy on facebook):
Fiedler was quick to criticize the OSF-project and Brian Nosek for making the mistake to ignore the well-known regression to the mean effect. This silly argument ignores that regression to the mean requires that the initial scores are selected, which is exactly the point of the OSF-replication studies.

Look, this passage shows that there is apparently a deep misunderstanding about the “silly argument”. Let me briefly try to explain once more what my critique of the Science article (not Brian Nosek personally – this is not my style) referred to.
At the statistical level, I was simply presupposing that there is common ground on the premise that regressiveness is ubiquitous; it is not contingent on selected initial scores. Take a scatter plot of 100 bi-variate points (jointly distributed in X and Y). If r(X,Y) < 1(disregarding sign), regressing Y on X will result in a regression slope less than 1. The variance of predicted Y scores will be reduced. I very much hope we all agree that this holds for every correlation, not just those in which X is selected. If you don’t believe, I can easily demonstrate it with random (i.e., non-selective vectors x and y).
Across the entire set of data pairs, large values of X will be underestimated in Y, and small values of X will be overestimated. By analogy, large original findings can be expected to be much smaller in the replication. However, when we regress X on Y, we can also expect to see that large Y scores (i.e., i.e., strong replication effects) have been weaker in the original. The Bayes factors reported by Alexander Etz in his “Bayesian reproducibility project”, although not explicit about reverse regression, strongly suggest that there are indeed quite a few cases in which replication results have been stronger than the original ones. Etz’ analysis, which nicely illustrates how a much more informative and scientifically better analysis than the one provided by Nosek might look like, also reinforces my point that the report published in Science is very weak. By the way, the conclusions are markedly different from Nosek, showing that most replication studies were equivocal. The link (that you have certainly found yourself) is provided below.

We know since Rulon (1941 or so) and even since Galton (1986 or so) that regression is a tricky thing, and here I get to the normative (as opposed to the statistical, tautological) point of my critique, which is based on the recommendation of such people as Don Campbell, Daniel Kahneman & Amos Tversky, Ido Erev, Tom Wallsten & David Budescu and many others, who have made it clear that the interpretation of retesting or replication studies will be premature and often mistaken, if one does not take the vicissitudes of regression into account. A very nice historical example is Erev, Wallsten & Budescu’s 1994 Psych. Review article on overconfidence. They make it clear you find very strong evidence for both overconfidence and underconfidence in the same data array, when you regress either accuracy on confidence or confidence on accuracy, respectively. Another wonderful demonstration is Moore and Small’s 2008 Psych. Review analysis of several types of self-serving biases.

So, while my statistical point is analytically true (because regression slope with a single predictor is always < 1; I know there can be suppressor effects with slopes > 1 in multiple regression), my normative point is also well motivated. I wonder if the audience of your Internet allusion to my “silly argument” has a sufficient understanding of the “regression trap” so that, as you write:

Everybody can make up their own mind and decide where they want to stand, but the choices are pretty clear. You can follow Fiedler, Strack, Baumeister, Gilbert, Bargh and continue with business as usual or you can change. History will tell what the right choice will be.

By the way, why you put me in the same pigeon hole as Fritz, Roy, Dan, and John. The role I am playing is completely different and it definitely not aims at business as usual. My very comment on the Nosek article is driven my deep concerns about the lack of scientific scrutiny in such a prominent journal, in which there is apparently no state-of-the-art quality control. A replication project is the canonical case of a scientific interpretation that strongly calls for awareness of the regression trap. That is, the results are only informative if one takes into account what shrinkage of strong effects could be expected by regression alone. Regressiveness imposes an upper limit on the possible replication success, which ought to be considered as a baseline for the presentation of the replication results.

To do that, it is essential to control for reliability. (I know that the reliability of individual scores within a study is not the same as the reliability of the aggregate study results, but they are of course related). I also continue to believe, strongly, that a good replication project ought to control for the successful induction of the independent variable, as evident in a manipulation check (maybe in an extra group), and that the sampling of the 100 studies itself was suboptimal. If Brian Nosek (or others) come up with a convincing interpretation of this replication project, then it is fine. However, the present analysis is definitely not convincing. It is rather a symptom of shallow science.

So, as you can see, the comments that you and Moritz Heene have sent me do not really affect these considerations. And, because there is obviously no common ground between the two of us, not even about the simplest statistical constraints, I have decided not to engage in a public debate with you. I’m afraid hardly anybody in this Facebook cycle will really invest time and work to read the literature necessary to judge the consequences of the regression trap, in order to make an informed judgment. And I do not want to nourish the malicious joy of an audience that apparently likes personal insults and attacks, detached from scientific arguments.

Kind regards, Klaus

P.S. As you can see, I CC this email to myself and to Joachim Krueger, who spontaneously sent me a similar note on the Nosek article and the regression trap.

http://scholarlycommons.law.northwestern.edu/cgi/viewcontent.cgi?article=7482&context=jclc&sei-redir=1&referer=http%3A%2F%2Fscholar.google.de%2Fscholar_url%3Fhl%3Dde%26q%3Dhttp%3A%2F%2Fscholarlycommons.law.northwestern.edu%2Fcgi%2Fviewcontent.cgi%253Farticle%253D7482%2526context%253Djclc%26sa%3DX%26scisig%3DAAGBfm25GOVXRqGWCcEzKXfDySpdZ9q8NA%26oi%3Dscholaralrt#search=%22http%3A%2F%2Fscholarlycommons.law.nor! thwester n.edu%2Fcgi%2Fviewcontent.cgi%3Farticle%3D7482%26context%3Djclc%22

Am 9/18/2015 um 3:21 PM schrieb Ulrich Schimmack:
Lieber Dr. Fiedler,

Kann ich noch auf eine Antwort von Ihnen warten oder soll ich annehmen dass Sie sich entschieden haben nicht auf meine Anfrage zu antworten?

LG, Uli Schimmack

Klaus Fiedler responds

Dear Ullrich:

Yes, I was indeed very, very busy over two weeks, working for the Humboldt foundation, for two conferences where I had to play leading roles, the Leopoldina Academy, and many other urgent jobs. Sorry but this is simply so.

I now received your email reminder to send you my comments to what you and Moritz Heene have written. However, it looks like you have already committed yourself publicly (I was sent this by colleagues who are busy on facebook):

Fiedler was quick to criticize the OSF-project and Brian Nosek for making the mistake to ignore the well-known regression to the mean effect. This silly argument ignores that regression to the mean requires that the initial scores are selected, which is exactly the point of the OSF-replication studies.

Look, this passage shows that there is apparently a deep misunderstanding about the “silly argument”. Let me briefly try to explain once more what my critique of the Science article (not Brian Nosek personally – this is not my style) referred to.

At the statistical level, I was simply presupposing that there is common ground on the premise that regressiveness is ubiquitous; it is not contingent on selected initial scores. Take a scatter plot of 100 bi-variate points (jointly distributed in X and Y). If r(X,Y) < 1(disregarding sign), regressing Y on X will result in a regression slope less than 1. The variance of predicted Y scores will be reduced. I very much hope we all agree that this holds for every correlation, not just those in which X is selected. If you don’t believe, I can easily demonstrate it with random (i.e., non-selective vectors x and y).

Across the entire set of data pairs, large values of X will be underestimated in Y, and small values of X will be overestimated. By analogy, large original findings can be expected to be much smaller in the replication. However, when we regress X on Y, we can also expect to see that large Y scores (i.e., i.e., strong replication effects) have been weaker in the original. The Bayes factors reported by Alexander Etz in his “Bayesian reproducibility project”, although not explicit about reverse regression, strongly suggest that there are indeed quite a few cases in which replication results have been stronger than the original ones. Etz’ analysis, which nicely illustrates how a much more informative and scientifically better analysis than the one provided by Nosek might look like, also reinforces my point that the report published in Science is very weak. By the way, the conclusions are markedly different from Nosek, showing that most replication studies were equivocal. The link (that you have certainly found yourself) is provided below.

We know since Rulon (1941 or so) and even since Galton (1986 or so) that regression is a tricky thing, and here I get to the normative (as opposed to the statistical, tautological) point of my critique, which is based on the recommendation of such people as Don Campbell, Daniel Kahneman & Amos Tversky, Ido Erev, Tom Wallsten & David Budescu and many others, who have made it clear that the interpretation of retesting or replication studies will be premature and often mistaken, if one does not take the vicissitudes of regression into account. A very nice historical example is Erev, Wallsten & Budescu’s 1994 Psych. Review article on overconfidence. They make it clear you find very strong evidence for both overconfidence and underconfidence in the same data array, when you regress either accuracy on confidence or confidence on accuracy, respectively. Another wonderful demonstration is Moore and Small’s 2008 Psych. Review analysis of several types of self-serving biases.

So, while my statistical point is analytically true (because regression slope with a single predictor is always < 1; I know there can be suppressor effects with slopes > 1 in multiple regression), my normative point is also well motivated. I wonder if the audience of your Internet allusion to my “silly argument” has a sufficient understanding of the “regression trap” so that, as you write:

Everybody can make up their own mind and decide where they want to stand, but the choices are pretty clear. You can follow Fiedler, Strack, Baumeister, Gilbert, Bargh and continue with business as usual or you can change. History will tell what the right choice will be.

By the way, why you put me in the same pigeon hole as Fritz, Roy, Dan, and John. The role I am playing is completely different and it definitely not aims at business as usual. My very comment on the Nosek article is driven my deep concerns about the lack of scientific scrutiny in such a prominent journal, in which there is apparently no state-of-the-art quality control. A replication project is the canonical case of a scientific interpretation that strongly calls for awareness of the regression trap. That is, the results are only informative if one takes into account what shrinkage of strong effects could be expected by regression alone. Regressiveness imposes an upper limit on the possible replication success, which ought to be considered as a baseline for the presentation of the replication results.

To do that, it is essential to control for reliability. (I know that the reliability of individual scores within a study is not the same as the reliability of the aggregate study results, but they are of course related). I also continue to believe, strongly, that a good replication project ought to control for the successful induction of the independent variable, as evident in a manipulation check (maybe in an extra group), and that the sampling of the 100 studies itself was suboptimal. If Brian Nosek (or others) come up with a convincing interpretation of this replication project, then it is fine. However, the present analysis is definitely not convincing. It is rather a symptom of shallow science.

So, as you can see, the comments that you and Moritz Heene have sent me do not really affect these considerations. And, because there is obviously no common ground between the two of us, not even about the simplest statistical constraints, I have decided not to engage in a public debate with you. I’m afraid hardly anybody in this Facebook cycle will really invest time and work to read the literature necessary to judge the consequences of the regression trap, in order to make an informed judgment. And I do not want to nourish the malicious joy of an audience that apparently likes personal insults and attacks, detached from scientific arguments.

Kind regards, Klaus

P.S. As you can see, I CC this email to myself and to Joachim Krueger, who spontaneously sent me a similar note on the Nosek article and the regression trap.

I made another attempt to talk about selection bias and ended pretty much with a simple yes/no question as a prosecutor asking a hostile witness.

Dear Klaus,

I don’t understand why we cannot even agree about the question that regression to the mean is supposed to answer.  

Moritz Heene and I are talking about the mean difference in effect sizes (the intercept, not the slope, in a regression).  According to the Science article, the effect sizes in the replication studies were, on average, 50% lower than the effect sizes in the original studies. My own analysis for social psychology show a difference of d = .6 and d = .2, which suggests results published in original articles are inflated by 200%.   Do you believe that regression to the mean can explain this finding?  Again, this is not a question about the slope, so please try to provide an explanation that can account for mean differences in effect sizes.  

Of course, you can just say that we know that a published significant result is inflated by publication bias.  After all, power is never 100% so if you select 100% significant results for publication, you cannot expect 100% successful replications.  The percentage that you can expect is determined by the true power of the set of studies (this has nothing to do with regression to the mean, it is simply power + publication bias.   However, the  OSF-reproducibility project did take power into account and increased sample sizes to account for the problem. They are also aware that the replication studies will not produce 100% successes if the replication studies were planned with 90% power. 

The problem that I see with the OSF-project is that they were naïve to use the observed effect sizes to conduct their power analyses. As these effect sizes were strongly inflated by publication bias, the true power was much lower than they thought it would be.  For social psychology, I calculated the true power of the original studies to be only 35%.  Increasing sample sizes from 90 to 120 does not make much of a difference with power this low.   If your point is simply to say that the replication studies were underpowered to reject the null-hypothesis, I agree with you.  But the reason for the low power is that reported results in the literature are not credible and strongly influenced by bias.  Published effect sizes in social psychology are, on average, 1/3 real and 2/3 bias.  Good luck finding the false positive results with evidence like this.

Do you disagree with any of my arguments about power,  publication bias, and the implication that social psychological results lack credibility?  

Best regards,

Uli

Klaus Fiedler’s response continues to evade the topic of selection bias that undermines the credibility of published results with a replication rate of 25%, but he acknowledges for the first time that regression works in both directions and cannot explain mean changes without selection bias..

Dear Uli, Moritz and Krueger:

I’m afraid it’s getting very basic now … we are talking about problems which are not really there … very briefly, just for the sake of politeness

First, as already clarified in my letter to Uli yesterday, nobody will come to doubt that every correlation < 1 will produce regression in both directions. The scatter plot does not have to be somehow selected. Let’s talk about (or simulate) a bi-variate random sample. Given r < 1, if you plot Y as a function of X (i.e., “given” X values), the regression curve will have a slope < 1, that is, Y values corresponding to high X values will be smaller and Y values corresponding to low X values will be higher. In one word, the variance in Y predictions (in what can be expected in Y) will shrink. If you regress X on Y, the opposite will be the case in the same data set. That’s the truism that I am referring to.

Of course, regression is always a conditional phenomenon. Assuming a regression of Y on X: If X is (very) high, the predicted Y analogue is (much) lower. If X is (very) low, the predicted Y analogue is (much) higher. But this conditional IF phrase does not imply any selectivity. The entire sample is drawn randomly. By plotting Y as a function of given X levels (contaminated with error and unreliability), you conditionalize Y values on (too) high or (too) low X values. But this is always the case with regression.

If I correctly understand the point, you simply equate the term “selective” with “conditional on” or “given”. But all this is common sense, or isn’t it. If you believe you have found a mathematical or Monte-Carlo proof that a correlation (in a bivariate distribution) is 1 and there is no regression (in the scatter plot), then you can probably make a very surprising contribution to statistics and numerical mathematics.

Of course, regression a multiplicative function of unreliability and extremity. So points have to be extreme to be regressive. But I am talking about the entire distribution …

Best, Klaus

… who is now going back to work, sorry.

At this point, Moritz Heene is willing to let it go. There is really no point in arguing with a dickhead – a slightly wrong translation of the German term “Dickkopf” (bull-headed, stubborn).

Lieber Uli,

Sorry, schnell auf Deutsch:
Angesichts der Email unten von Fiedler sehe ich es als “fruitless endeavour” an, da noch weiter zu diskutieren. Er geht auf unsere -formal korrekten!- Argumente überhaupt nicht ein und mittlerweile ist er schon bei “Ihr seid es gar nicht wert, dass ich mit Euch diskutiere”
angekommen. Auch, dass er Ferby (1973) nachweislich falsch zitiert, ist ihm keine Erwähnung wert. Ich diskutiere das nun nicht mehr mit ihm, weil er es einfach nicht einsehen will und daher unsere mathematisch korrekten Argumente einfach nicht mehr erwähnt (tactical ignorance).

Eines der großen Probleme der Psychologie ist, dass die Probleme grauenhaft basal zu widerlegen sind. Bspw. ist das “hidden-moderatorArgument” am Stammtisch mit 1.3 Promille noch zu widerlegen. Taucht aber leider in Artikeln von Strack und Stroebe und anderen immer wieder auf.

I agreed with him and decided to write a blog post about this fruitless discussion. I didn’t until now, when the PoPS scandal reminded me of Fiedler’s “I am never wrong” attitude.

Hallo Moritz,

Ja Diskussion ist zu Ende.
Nun werde ich ein blog mit den emails schreiben um zu zeigen mit welchen schadenfeinigen (? Ist das wirklich ein Wort) Argumenten gearbeitet wird.

Null Respekt fuer Klaus Fiedler.

LG, Uli

I communicated our decision to end the discussion to Klaus Fiedler in a final email.

Dear Klaus,

Last email  from me to you. 

It is sad that you are not even trying to answer my questions about the results of the reproducibility project.

I also going back to work now, where my work is to save psychology from psychologists like you who continue to deny that psychology has been facing a crisis for 50 years, make some quick bogus statistical arguments to undermine the credibility of the OSF-reproducibility project, and then go back to work as usual.

History will decide who wins this argument.

Disappointed (which implies that I had expected more for you when I started this attempt at a scientific discussion), Uli

Klaus Fiedler replied with his last email.

Dear Uli:

no, sorry that is not my intention … and not my position. I would like to share with you my thoughts about reproducibility … and I am not at all happy with the (kernel of truth) of the Nosek report. However, I believe the problems are quite different from those focused in the current debate, and in the premature consequences drawn by Nosek, Simonsohn, an others. You may have noticed that I have published a number of relevant articles, arguing that what we are lacking is not better statistics and larger subject samples but a better methodology more broader. Why should we two (including Moritz and Joachim and
others) not share our thoughts, and I would also be willing to read your papers. Sure. For the moment, we have been only debating about my critique of the Nosek report. My point was that in such a report of replications plotted against originals,

  • an informed interpretation is not possible unless one takes regression into acount
  • one has to control for reliability as a crucial moderator
  • one has to consider manipulation checks
  • one has to contemplate sampling of studies

Our “debate” about 2+2=4 (I agree that’s what it was) does not affect this critique. I do not believe that I am at variance with your mathematical sketch, but it does not undo the fact that in a bivariate distribution of 100 bivariate points, the devil is lurking in the regression trap.

So please distinguish between the two points: (a) Nosek’s report does not live up to appropriate standards; but (b) I am not unwilling to share with you my thoughts about replicability. (By the way, I met Ioannidis some weeks ago and I never saw as clearly as now that he, like Fanelli, whom I also met, believe that all behavioral science is unreliable and invalid)

Kind regards, Klaus

More Gaslighting about the Replication Crisis by Klaus Fiedler

Klaus Fiedler and Norbert Schwarz are both German-born influential social psychologists. Norbert Schwarz migrated to the United States but continued to collaborate with German social psychologists like Fritz Strack. Klaus Fiedler and Norbert Schwarz have only one peer-reviewed joined publication titled “Questionable Research Practices Revisited” This article is based on John, Loewenstein, & Prelec’s (2012) influential article that coined the term “questionable research practices” In the original article, John et al. (2012) conducted a survey and found that many researchers admit that they used QRPs and also found these practices were acceptable (i.e., not a violation of ethical norms about scientific integrity). John et al.’s (2012) results provide a simple explanation for the outcome of the reproducibility project. Researchers use QRPs to get statistically significant results in studies with low statistical power. This leads to an inflation of effect sizes. When these studies are replicated WITHOUT QRPs, effect sizes are closer to the real effect sizes and lower than the inflated estimates in replications. As a result, the average effect size shrinks and the percentage of significant results decreases. All of this was clear, when Moritz Heene and I debated with Fiedler.

Fiedler and Schwarz’s article had one purpose, namely to argue that John et al.’s (2012) article did not provide credible evidence for the use of QRPs. The article does not make any connection between the use of QRPs and the outcome of the reproducibility project.

The resulting prevalence estimates are lower by order of magnitudes. We conclude that inflated prevalence estimates, due to problematic interpretation of survey data, can create a descriptive norm (QRP is normal) that can counteract the injunctive norm to minimize QRPs and unwantedly damage the image of behavioral sciences, which are essential to dealing with many societal problems” (Fiedler & Schwarz, 2016, p. 45).

Indeed, the article has been cited to claim that “questionable research practices” are not always questionable and that “QRPs may be perfectly acceptable given a suitable context and verifiable justification (Fiedler & Schwarz, 2016; …) (Rubin & Dunkin, 2022).

To be clear what this means. Rubin and Dunkin claim that it is perfectly acceptable to run multiple studies and publish only those that worked, drop observations to increase effect sizes, and to switch outcome variables after looking at the results. No student will agree that these practices are scientific or trust results based on such practices. However, Fiedler and other social psychologists want to believe that they did nothing wrong when they engaged in these practices to publish.

Fiedler triples down on Immaculate Regression

I assumed everybody had moved on from the heated debates in the wake of the reproducibility project, but I was wrong. Only a week ago, I discovered an article by Klaus Fiedler – with a co-author with one of his students that repeats the regression trap claims in an English-language peer-reviewed journal with the title “The Regression Trap and Other Pitfalls of Replication Science—Illustrated by the Report of the Open Science Collaboration” (Fiedler & Prager, 2018).

ABSTRACT
The Open Science Collaboration’s 2015 report suggests that replication effect sizes in psychology are modest. However, closer inspection reveals serious problems.

A more general aim of our critical note, beyond the evaluation of the OSC report, is to emphasize the need to enhance the methodology of the current wave of simplistic replication science.

Moreover, there is little evidence for an interpretation in terms of insufficient statistical power.

Again, it is sufficient to assume a random variable of positive and negative deviations (from the overall mean) in different study domains or ecologies, analogous to deviations of high and low individual IQ scores. One need not attribute such deviations to “biased” or unfair measurement procedures, questionable practices, or researcher expectancies.

Yet, when concentrating on a domain with positive deviation scores (like gifted students), it is permissible—though misleading and unfortunate—to refer to a “positive bias” in a technical sense, to denote the domain-specific enhancement.

Depending on the selectivity and one- sided distribution of deviation scores in all these domains, domain-specific regression effects can be expected.

How about the domain of replication science? Just as psychopathology research produces overall upward regression, such that patients starting in a crisis or a period of severe suffering (typically a necessity for psychiatric diagnoses) are better off in a retest, even without therapy (Campbell, 1996), research on scientific findings must be subject to an opposite, downward regression effect. Unlike patients representing negative deviations from normality, scientific studies published in highly selective journals constitute a domain of positive deviations, of well-done empirical demonstrations that have undergone multiple checks on validity and a very strict review process. In other words, the domain of replication science, major empirical findings, is inherently selective. It represents a selection of the most convincing demonstrations of obtained effect sizes that should exceed most everyday empirical observations. Note once more that the emphasis here is not on invalid effects or outliers but on valid and impressive effects, which are, however, naturally contaminated with overestimation error (cf. Figure 2).

The domain-specific overestimation that characterizes all science is by no means caused by publication bias alone. [!!!!! the addition of alone here is the first implicit acknowledgement that publication bias contributes to the regression effect!!!!]

To summarize, it is a moot point to speculate about the reasons for more or less successful replications as long as no evidence is available about the reliability of measures and the effectiveness of manipulations.

In the absence of any information about the internal and external validity (Campbell, 1957) of both studies, there is no logical justification to attribute failed replications to the weakness of scientific hypotheses or to engage in speculations about predictors of replication success.

A recent simulation study by Stanley and Spence (2014) highlights this point, showing that measurement error and sampling error alone (Schmidt, 2010) can greatly reduce the replication success of empirical tests of correct hypotheses in studies that are not underpowered.

Our critical comments on the OSC report highlight the conclusion that the development of such a methodology is sorely needed.

Final Conclusion

Fiedler’s illusory regression account of the replication crisis was known to me since 2015. It was not part of the official record. However, his articles with Schwarz in 2016 and Prager in 2018 are part of his official CV. The articles show a clear motivated bias against Open Science and the reforms initiated by social psychologists to fix their science. He was fired because he demonstrated the same arrogant dickheadedness in interactions with a Black scholar. Does this mean he is a racist? No, he also treats White colleagues with the same arrogance, yet when he treated Roberts like this he abused his position as gate-keeper at an influential journal. I think APS made the right decision to fire him, but they were wrong to hire him in the first place. The past editors of PoPS have shown that old White eminent psychologists are unable to navigate the paradigm shift in psychology towards credibility, transparency, and inclusivity. I hope APS will learn a lesson from the reputational damage caused by Fiedler’s actions and search for a better editor that represents the values of contemporary psychologists.

P.S. This blog post is about Klaus Fiedler, the public figure and his role in psychological science. It has nothing to do with the human being.

P.P.S I also share the experience of being forced from an editorial position with Klaus. I was co-founding editor of Meta-Psychology and made some controversial comments about another journal that led to a negative response. To save the new journal, I resigned. It was for the better and Rickard Carlsson is doing a much better job alone than we could have done together. It hurt a little, but live goes on. Reputations are not made by a single incidence, especially if you can admit to mistakes.

Implicit Bias ? Unconscious Bias

Preface

The journal Psychological Inquire publishes theoretical articles that are accompanied by commentaries. In a recent issue, prominent implicit cognition researchers discussed the meaning of the term implicit. This blog post differs from the commentaries by researchers in the field, by providing an outsider perspective and by focusing on the importance of communicating research findings clearly to the general public. This purpose of definitions was largely ignored by researchers who are more focused on communicating with each other than with the general public. I will show that this unique outsider perspective favors a definition of implicit bias in terms of the actual research that has been conducted under the umbrella of implicit social cognition research rather than proposing a definition that renders 30 years of research useless with a simple stroke of a pen. If social cognition researchers want to communicate about implicit bias as empirical scientists they have to define implicit bias as effects of automatically activated information (associations, stereotypes, attitudes) on behavior. This is what they have studied for 30 years. Defining implicit bias as unconscious bias is not helpful because 30 years of research have failed to provide any evidence that people can act in a biased way without awareness. Although unconscious biases may occur, there is currently no scientific evidence to inform the public about unconscious biases. While the existing research on automatically activated stereotypes and attitudes has problems, the topic remains important. As the term implicit bias has caught on, it can be used in communications with the public about, but it should be made clear that implicit does not mean unconscious.

Introduction

Psychologists are notoriously sloppy with language. This leads to misunderstandings and unnecessary conflicts among scientists. However, the bigger problem is a break-down in communication with the general public. This is particularly problematic in social psychology because research on social issues can influence public discourse and ultimately policy decisions.

One of the biggest case-studies of conceptual confusion that had serious real-world consequences is the research on implicit cognition that created the popular concept of implicit bias. Although the term implicit bias is widely used to talk about racism, the term lacks clear meaning.

The Stanford Encyclopedia of Philosophy defines implicit bias as a tendency to “act on the basis of prejudice and stereotypes without intending to do so.” However, lack of intention (not wanting to) is only one of several meanings of the term implicit. Another meaning of the word implicit is automatic activation of thoughts. For example, a Scientific American article describes implicit bias as a “tendency for stereotype-confirming thoughts to pass spontaneously through our minds.” Notably, this definition of implicit bias clearly implies that people are aware of the activated stereotype. The stereotype-confirming thought is in people’s mind and not activated in some other area of the brain that is not accessible to consciousness. This definition also does not imply that implicit bias results in biased behavior because awareness makes it possible to control the influence of activated stereotypes on behavior.

Merriam Webster Dictionary offers another definition of implicit bias as “a bias or prejudice that is present but not consciously held or recognized.” In contrast to the first two meanings of implicit bias, this definition suggests that implicit bias may occur without awareness; that is implicit bias = unconscious bias.

The different definitions of implicit bias lead to very different explanations of biased behavior. One explanation assumes that implicit biases can be activated and guide behavior without awareness and individuals who act in a biased way may either fail to recognize their biases or make up some false explanation for their biased behaviors after the fact. This idea is akin to Freud’s notion of a powerful, autonomous unconscious (the Id) that can have subversive effects on behavior that contradict the values of a conscious, moral self (Super-Ego). Given the persistent influence of Freud on contemporary culture, this idea of implicit bias is popular and reinforced by the Project Implicit website that offers visitors tests to explore their hidden (hidden = unconscious) biases.

The alternative interpretation of implicit bias is less mysterious and more mundane. It means that our brain constantly retrieves information from memory that is related to the situation we are in. This process does not have a filter to retrieve only information that we want. As a result, we sometimes have unwanted thoughts. For example, even individuals who do not want to be prejudice will sometimes have unwanted stereotypes and associated negative feelings pop into their mind (Scientific American). No psychoanalysis or implicit test is needed to notice that our memory has stored stereotypes. In safe contexts, we may even laugh about them (Family Guy). In theory, awareness that a stereotype was activated also makes it possible to make sure that it does not influence behavior. This may even be the main reason for our ability to notice what our brain is doing. Rather than acting in a reflexive way to a situation, awareness makes it possible to respond more flexible to a situation. When implicit is defined as automatic activation of a thought, the distinction between implicit and explicit bias becomes minor and academic because the processes that retrieve information information from memory are automatic. The only difference between implicit and explicit retrieval of information is that the process may be triggered spontaneously by something in our environment or by a deliberate search for information.

After more than 30 years of research on implicit cognitions (Fazio, Sanbonmatsu, Powell, Kardes, 1986), implicit social cognition researchers increasingly recognize the need for clearer definitions of the term implicit (Gawronski, Ledgerwood, & Eastwick, 20222a), but there is little evidence that they can agree on a definition (Gawronski, Ledgerwood, & Eastwick, 20222b). Gawronski et al. (2022a, 2022b) propose to limit the meaning of implicit bias to unconscious biases; that is, individuals are unaware that their behavior was influenced by activation of negative stereotypes or affects/attitudes. “instances of bias can be described as implicit if respondents are unaware of the effect of social category cues on their behavioral response” (p. 140). I argue that this definition is problematic because there is no scientific evidence to support the hypothesis that prejudice is unconscious. Thus, the term cannot be used to communicate scientific results that have been obtained by implicit cognition researchers over the past three decades because these studies did not study unconscious bias.

Implicit Bias Is Not Unconscious Bias

Gawronski et al. note that their decision to limit the term implicit to mean unconscious is arbitrary. “A potential objection against our arguments might be that they are based on a particular interpretation of implicit in IB that treats the term as synonymous with unconscious” (p. 145). Gawronski et al. argue in favor of their definition because “unconscious biases have the potential to cause social harm in ways that are fundamentally different from conscious biases that are unintentional and hard-to-control” (p. 146). The key words in this argument is “have the potential,” which means that there is no scientific evidence that shows different effects of biases with and without awareness of bias. Thus, the distinction is merely a theoretical, academic one without actual real-world implications. Gawronski et al. agree with this assessment when they point out that existing implicit cognition research “provides no information about IB [implicit bias] if IB is understood as an unconscious effect of social category cues on behavioral responses. It seems bizarre to define the term implicit bias in a way that makes all of the existing implicit cognition research irrelevant. A more reasonable approach would be to define implicit bias in a way that is more consistent with the work of implicit bias researchers. As several commentators pointed out, the most widely used meaning of implicit is automatic activation of information stored in memory about social groups. In fact, Gawronski himself used the term implicit in this sense and repeatedly pointed out that implicit does not mean unconscious (i.e., without awareness) (Appendix 1).

Defining the term implicit as automatic activation makes sense because the standard experimental procedure to study implicit cognition is based on presenting stimuli (words, faces, names) related to a specific group and to examine how these stimuli influence behaviors such as the speed of pressing a button on a keyboard. The activation of stereotypic information is automatic because participants are not told to attend to these stimuli or even to ignore them. Sometimes the stimuli are also presented in subtle ways to make it less likely that participants consciously attend to them. The question is always whether these stimuli activate stereotypes and attitudes stored in memory and how activation of this information influences behavior. If behavior is influenced by the stimuli, it suggests that stereotypic information was activated – with or without awareness. The evidence from studies like these provides the scientific basis for claims about implicit bias. Thus, implicit bias is basically operationally defined as systematic effects of automatically activated information about groups on behavior.

The aim of implicit bias research is to study real-word incidences of prejudice under controlled laboratory conditions. A recent incidence at racism shows how activation of stereotypes can have harmful consequences for victims and perpetrators of racist behavior .

University of Kentucky student who repeatedly hurled racist slur at Black student permanently banned from campus

The question of consciousness is secondary. What is important is how individuals can prevent harmful consequences of prejudice. What can individuals do to avoid storing negative stereotypes and attitudes in the first place? What can individuals do to weaken stored memories and attitudes? What can individuals do to make it less likely that stereotypes are activated? What can individuals do to control the influence of attitudes when they are activated? All of these questions are important and are related to the concept of implicit as automatic activation of attitudes. The only reason to emphasize unconscious process would be a scenario where individuals are unable to control the influence of information that influences behavior without awareness. However, given the lack of evidence that unconscious biases exist, it is currently unnecessary to focus on this scenario. Clearly, many instances of biases occur with awareness (“White teacher in Texas fired after telling students his race is ‘the superior one’”).

Unfortunately, it may be surprising for some readers to learn that implicit does not mean unconscious because the term implicit bias has been popularized in part to make a distinction between well-known forms of bias and prejudice and a new form of bias that can influence behavior even when individuals are consciously trying to be unbiased. These hidden biases occur against individuals’ best intentions because they exist in a blind spot of consciousness. This meaning of implicit bias was popularized by Banaji and Greenwald (2013), who also founded the Project Implicit website that provides individuals with feedback about their hidden biases; akin to psychoanalysts who can recover repressed memories.

Gawronski et al. (2022b) point out that Greenwald and Banaji’s theory of unconscious bias evolved independently of research by other implicit bias researchers who focused on automaticity and were less concerned about the distinction between conscious and unconscious biases. Gawronski’s definition of implicit bias as unconscious bias favors Banaji and Greenwald’s school of thought (hidden bias) over other research programs (automatically activated biases). The problem with this decision is that Greenwald and Banaji recently walked back their claims about unconscious biases and no longer maintain that the effects they studies were obtained without awareness (Implicit = Indirect & Indirect ? Unconscious, Greenwald & Banaji, 2017). The reversal of their theoretical position is evident in their statement that “even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning [unconscious vs. conscious], they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 892). It is puzzling to see Gawronski arguing for a definition that is based on a theory that the authors no longer endorse. Given the lack of scientific evidence that stereotypes regularly lead to biases without awareness, this might be the time to agree on a definition that matches the actual research by implicit cognition researchers, and the most fitting definition would be automatic activation of stereotypes and attitudes, not unconscious causes of behavior.

Gawronski et al. (2022a) also falsely imply that implicit cognition researchers have ignored the distinction between conscious and unconscious biases. In reality, numerous studies have tried to demonstrate that implicit biases can occur without awareness. To study unconscious biases, social cognition researchers have relied heavily on an experimental procedure known as subliminal priming. In a subliminal priming study, a stimulus (prime) is presented very briefly, outside of the focus of attention, and/or with a masking stimuli. If a manipulation check shows that individuals have no awareness of the prime and the prime influences behavior, the effect appears to occur without awareness. Several studies suggested that racial primes can influence behavior without awareness (Bargh et al., 1996; Davis, 1989).

However, the credibility of these results has been demolished by the replication crisis in social psychology (Open Science Collaboration, 2015; Schimmack, 2020). Priming research has been singled out as the field with the biggest replication problems (Kahneman, 2012). When asked to replicate their own findings, leading priming researchers like Bargh refused to do so. Thus, while subliminal priming studies started the implicit revolution (Greenwald & Banaji, 2017), the revolution imploded over the past decade when doubts about the credibility of the original findings increased.

Unfortunately, researchers within the field of implicit bias research often ignore the replication crisis and cite questionable evidence as if it provided solid evidence for unconscious biases. For example, Gawronski et al. (2022b) suggest that unconscious biases may contribute to racial disparities in use-of-force errors such as the high-profile killing of Philando Castile. To make this case, they use a (single) study of 58 White undergraduate students (Correll, Wittenbrink, Crawford, & Sadler, 2015, Study 3). The study asked participants to make shoot vs. no-shoot decisions in a computer task (game) that presented pictures of White or Black men holding a gun or another object. Participants were instructed to make one quick decision within 630 milliseconds and another decision without time restriction. Gawronski et al. suggest that failures to correct an impulsive error given ample time to do so constitutes evidence of unconscious bias. They summarized the results as evidence that “unconscious effects on basic perceptual processes play a major role in tasks that more closely resemble real-world settings” (p. 226).

Fact checking reveals that this characterization of the study and its results is at least misleading, if not outright false. First, it is important to realize that the critical picture was presented for only 175ms and immediately replaced by another picture to wipe out visual memory. Although this is not a strictly subliminal presentation of stimuli, it is clearly a suboptimal presentation of stimuli. As a result, participants sometimes had to guess what the object was. They also had no other information to know whether their initial perception was correct or incorrect. The fact that participants’ performance improved without time pressure may be due to response errors under time pressure and this improvement was evident independent of the race of the men in the picture.

Without time pressure, participants shot 85% of armed Black men and 83% of armed White men. For unarmed men, participants shot 28% Black men and 25% White men. The statistical comparison of these differences showed weak effect of a systematic bias. The comparison for unarmed men produced a p-value that was just significant with the standard criterion of alpha = .05 criterion, F(1,53) = 6.65, p = .013, but not the more stringent criterion of alpha = .005 that is used to predict a high chance of replication. The same is true for the comparison of responses to pictures of unarmed men, F(1,53) = 4.96, p =.031. To my knowledge, this study has not been replicated and Gawronski et al.’s claim rests entirely on this single study.

Even if these effects could be replicated in the laboratory, they do not provide any information about unconscious biases in the real world because the study lacks ecological validity. To make claims about the real world, it is necessary to study police officers in simulations of real world scenarios (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021). This research is rare, difficult, and has not yet produced conclusive results. Andersen et al. (2021) found a small racial bias, but the sample was too small to provide meaningful information about the amount of racial bias in the real world. Most important, however, real-word scenarios provide ample information to see whether a suspect is Black or White and is armed or not. The real decision is often whether use of force is warranted or not. Racial biases in these shooting errors are important, but they are not unconscious biases.

Contrary to Gawronski et al., I do not believe that social cognition researchers focus on automatic biases rather than unconscious biases was a mistake. The real mistake was the focus on reaction times in artificial computer tasks rather than studying racial biases in the real world. As a result, thirty years of research on automatic biases has produced little insights into racial biases in the real world. To move the field towards the study of unconscious biases would be a mistake. Instead, social cognition researchers need to focus on outcome variables that matter.

Conclusion

The term implicit bias can have different meanings. Gawronski et al. (2022a) proposed to limit the meaning of the term to unconscious bias. I argue that this definition of implicit bias is not useful because most studies of implicit cognition are studies in which racial stereotypes and attitudes toward stigmatized groups are automatically activated. In contrast, priming studies that tried to distinguish between conscious and unconscious activation of this information have been discredited during the replication crisis and there exists no credible empirical evidence to suggest that unconscious biases exist or contribute to real-world behavior. Thus, funding a new research agenda focusing on unconscious biases may waste resources that are better spent on real-world studies of racial biases. Evidently, this conclusion diverges from the conclusion of implicit cognition researchers who are interested in continuing their laboratory studies, but they have failed to demonstrate that their work makes a meaningful contribution to society. To make research on automatic biases more meaningful, implicit bias research needs to move from artificial outcomes like reaction times on computer tasks to actual behaviors.

Appendix 1

Implicit Cognition Research Focusses on Automatic (Not Unconscious) Processes

Gawronski & Bodenhausen (2006), WOS/11/22 1,537

“If eras of psychological research can be characterized in terms of general ideas, a major theme of the current era is probably the notion of automaticity” (p. 692)

This perspective is also dominant in contemporary research on attitudes, in which deliberate, “explicit” attitudes are often contrasted with automatic, “implicit” attitudes (Greenwald & Banaji, 1995; Petty, Fazio, & Brin˜ol, in press; Wilson, Lindsey, &
Schooler, 2000; Wittenbrink & Schwarz, in press).

“We assume that people generally do have some degree of conscious access to their automatic affective reactions and that they tend to rely on these affective reactions in making evaluative judgments (Gawronski, Hofmann, & Wilbur, in press; Schimmack & Crites, 2005) (p. 696).

Conrey, Sherman, Gawronski, Hugenberg, & Groom (2005) , WOS/11/22

“The distinction between automatic and controlled processes now occupies a central role in many areas of social psychology and is reflected in contemporary dual-process theories of prejudice and stereotyping (e.g., Devine, 1989)” (p. 469)

“Specifically, we argued that performance on implicit measures is influenced by at least four different processes: the automatic activation of an association (association activation), the ability to determine a correct response (discriminability), the success at overcoming automatically activated associations (overcoming bias), and the influence of response biases
that may influence responses in the absence of other available guides to response (guessing)” (p. 482)

Gawronski & DeHouwer (2014), WOS 11/22 240

” other researchers assume that the two kinds of 11lL’asurcs tap into distinct memory representations, such that explicit measures tap into conscious representations whereas implicit measures tap into unconscious representations (e.g., Greenwald &
Banaji, 1995). Although the conceptualizations arc relatively common in the literature on implicit measures, we believe that it is concecptually more appropriate to classify different measures in terms of whether the tobe-measured psychological attribute influences participants’ responses on the task in an automatic fashion (De Houwer, Teige-Mocigemba, Spruyt, & Moors, 2009).” (p. 283)

Hofmann, Gawronski, Le, & Schmitt, PSPB, 2005, WoS/11/22

“These [implicit] measures—most of them based on reaction times in response compatibility tasks (cf. De Houwer, 2003)—are intended to assess relatively automatic mental associations that are difficult to gauge with explicit self-report measures”. (p. 1369)

Gawronski, Hofmann, & Wilbur (2006), WoS/11/22 200

“A common explanation for these findings is that the spontaneous behavior assessed in these
studies is difficult to control, and thus more likely to be influenced by automatic evaluations, such as they are reflected in indirect attitude measures” (p. 492)

“there is no empirical evidence that people lack conscious awareness of indirectly assessed attitudes per se” (p. 496)

Gawronski, LeBel, & Peters, PoPS (2007) WOS/11/22 187

“The central assumption in this model is that indirect measures provide a proxy for the activation of associations in memory” (p. 187)

Gawronski & LeBel, JESP (2008) WOS/11/22

“We argue that implicit measures provide a proxy for automatic associations in memory,
which may or may not influence verbal judgments reflected in self-report measures” (p. 1356)

Deutsch, Gawronski, & Strack, JPSP (2006), WOS/11/22 122

“Phenomena such as stereotype and attitude activation can be readily reconstructed as instance-based automaticity. For example, perceiving a person of a stereotyped group or an
attitude object may be sufficient to activate well-practiced stereotypic or evaluative associations in memory” (p. 386)

Implicit measures are important even if they do not assess unconscious processes.

Hofmann, Gawronski, Le, & Schmitt, PSPB, 2005, WoS/11/22

” Arguably one of the most important contributions in social cognition research within the last decade was the development of implicit measures of attitudes, stereotypes, self-concept, and self-esteem (e.g., Fazio, Jackson, Dunton, & Williams, 1995; Greenwald, McGhee, & Schwartz, 1998; Nosek & Banaji, 2001; Wittenbrink, Judd, & Park, 1997).” (p. 1369)

Gawronski & DeHouwer (2014), WOS 11/22 240

“For the decade to come, we believe that the field would benefit from a stronger focus on underlying mechanisms with regard to the measures themselves as well as their capability to predict behavior (see also Nosek, Hawkins, & Frazier, 2011).” (p. 303)