Estimating the False Discovery Risk of Psychology Science

Abstract

Since 2011, the credibility of psychological science is in doubt. A major concern is that questionable research practices could have produced many false positive results, and it has been suggested that most published results are false. Here we present an empirical estimate of the false discovery risk using a z-curve analysis of randomly selected p-values from a broad range of journals that span most disciplines in psychology. The results suggest that no more than a quarter of published results could be false positives. We also show that the false positive risk can be reduced to less than 5% by using alpha = .01 as the criterion for statistical significance. This remedy can restore confidence in the direction of published effects. However, published effect sizes cannot be trusted because the z-curve analysis shows clear evidence of selection for significance that inflates effect size estimates.

Introduction

Several events in the early 2010s led to a credibility crisis in psychology. As journals selectively publish only statistically significant results, statistical significance loses its, well, significance. Every published focal hypothesis will be statistically significant, and it is unclear which of these results are true positives and which are false positives.

A key article that contributed to the credibility crisis was Simmons, Nelson, & Simonsohn’s article “False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”

The title made a bold statement that it is easy to obtain statistically significant results even when the null-hypothesis is true. This led to concerns that many, if not most, published results are indeed false positive results. Many meta-psychological articles quoted Simmons et al.’s (2011) article to suggest that there is a high risk or even a high rate of false positive results in the psychological literature; including my own 2012 article.

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

The Appendix lists citations from influential meta-psychological articles that imply a high false positive risk in the psychological literature. Only one article suggested that fears about high false positive rates may be unwarranted (Strobe & Strack, 2014). In contrast, other articles have suggested that false positive rates might be as high as 50% or more (Szucs & Ioannidis, 2017).

There have been two noteworthy attempts at estimating the false discovery rate in psychology. Szucs and Ioannidis (2017) automatically extracted p-values from five psychology journals and estimated the average power of extracted t-tests. They then used this power estimate in combination with the assumption that psychologists discover one true, non-zero, effect for every 13 true null-hypotheses to suggest that the false discovery rate in psychology exceeds 50%. The problem with this estimate is that it relies on the questionable assumption that psychologists tests a very small percentage of true hypotheses.

The other article tried to estimate the false positive rate based on 70 of the 100 studies that were replicated in the Open Science Collaboration project (Open Science Collaboration, 2015). The statistical model estimated that psychologists test 93 true null-hypotheses for every 7 true effects (true positives), and that true effects are tested with 75% power (Johnson et al., 2017). This yields a false positive rate of about 50%. The main problem with this study is the reliance on a small, unrepresentative sample of studies that focused heavily on experimental social psychology, a field that triggered concerns about the credibility of psychology in general (Schimmack, 2020). Another problem is that point estimates based on a small sample are unreliable.

To provide new and better information about the false positive risk in psychology, we conducted a new investigation that addresses three limitations of the previous studies. First, we used hand-coding of focal hypothesis tests, rather than automatic extraction of all test-statistics. Second, we sampled from a broad range of journals that cover all areas of psychology rather than focusing narrowly on experimental psychology. Third, we used a validated method to estimate the false discovery risk based on an estimate of the expected discovery rate (Bartos & Schimmack, 2021). In short, the false discovery risk decreases as a monotonic function of the number of discoveries (i.e., p-values below .05) (Soric, 1989).

Z-curve relies on the observation that false positives and true positives produce different distributions of p-values. To fit a model to distributions of significant p-values, z-curve transforms p-values into absolute z-scores. We illustrate z-curve with two simulation studies. The first simulation is based on Simmons et al.’s (2011) scenario in which the combination of four questionable research practices inflates the false positive risk from 5% to 60%. In our simulation, we assumed an equal number of true null-hypotheses (effect size d = 0) and true hypotheses with small to moderate effect sizes (d = .2 to .5). The use of questionable research practices also increases the chances of getting a significant result for true hypotheses. In our simulation, the probability to get significance with true H0 was 58%, whereas the probability to get significance with true H1 was .93. Given the 1:1 ratio of H0 and H1 that were tested, this yields a false discovery rate of 39%.

Figure 1 shows that questionable research practices produce a steeply declining z-curve. Based on this shape, z-curve estimates a discovery rate of 5%, with a 95%CI ranging from 5% to 10%. This translates into estimates of the false discovery risk of 100% with a 95%CI ranging from 46% to 100% (Soric, 1989). The reason why z-curve provides a conservative estimate of the false discovery risk is that p-hacking changes the shape of the distribution in a way that produces even more z-values just above 1.96 than mere selection for significance would produce. In other words, p-hacking destroys evidential value when true hypotheses are being tested. It is not necessary to simulate scenarios in which even more true null-hypotheses are being tested because this would make the z-curve even steeper. Thus, Figure 1 provides a prediction for our z-curve analyses based on actual data, if psychologists heavily rely on Simmons et al.’s recipe to produce significant results.

Figure 2 is based on a simulation of Johnson et al.’s (2013) scenario with a 9% discovery rate (9 true hypotheses for very 100 hypothesis tests), a false discovery rate of 50%, and power to detect true effects of 75% (Figure 2). Johnson et al. did not assume or model p-hacking.

The z-curve for this scenario also shows a steep decline that can be attributed to the high percentage of false positive results. However, there is also a notable tail with z-values greater than 3 that reflects the influence of true hypotheses with adequate power. In this scenario, the expected discovery rate is higher with a 95%CI ranging from 7% to 20%. This translates into a 95%CI for the false discovery risk ranging from 21% to 71% (Soric, 1989). This interval contains the true value of 50%, although the point estimate, 34% underestimates the true value. Thus, we recommend to use the upper limit of the 95%CI as an estimate of the maximum false discovery rate that is consistent with data.

We now turn to real data. Figure 3 shows a z-curve analysis of Kühberger, Frity, and Scherndl (2014) data. The authors conducted an audit of psychological research by randomly sampling 1,000 English language articles published in the year 2007 that were listed in PsychInfo. This audit produced 344 significant p-values that could be subjected to a z-curve analysis. The results differ notably from the previous results. The expected discovery rate is higher and implies a much smaller false discovery risk of only 9%. However, due to the small set of studies, the confidence interval is wide and allows for nearly 50% false positive results.

To produce a larger set of test-statistics, my students and I have hand-coded over 1,000 randomly selected articles from a broad range of journals (Schimmack, 2021). These data were combined with Motyl et al.’s (2017) coding of social psychology journals. The time period spans the years 2008 to 2014, with a focus on the year 2010 and 2009. This dataset produced 1,715 significant p-values. The estimated false discovery risk is similar to the estimate for Kühberger et al.’s (2014) studies. Although the point estimate for the false discovery risk is a bit higher, 12%, the upper bound of the 95%CI is lower because the confidence interval is tighter.

Given the similarity of the results, we combined the two datasets to obtain an even more precise estimate of the false discovery risk based on 2,059 significant p-values. However, the upper limit of the 95%CI decreased only slightly from 30% to 26%.

The most important conclusion from these findings is that concerns about the amount of false positive results have exaggerated assumptions about the prevalence of false positive results in psychology journals. The present results suggest that at most a quarter of published results are false positives and that actual z-curves are very different from those implied by the influential simulation studies of Simmons et al. (2011). Our empirical results show no evidence that massive p-hacking is a common practice.

However, a false positive rate of 25% is still unacceptably high. Fortunately, there is an easy solution to this problem because the false discovery rate depends on the significance threshold. Based on their pessimistic estimates, Johnson et al. (2015) suggested to lower alpha to .005 or even .001. However, these stringent criteria would render most published results statistically non-significant. We suggest to lower alpha to .01. Figure 6 shows the rational for this recommendation by fitting z-curve with alpha = .01 (i.e., the red vertical line that represents the significance criterion is moved from 1.96 to 2.58.

Lowering alpha to .01, lowers the percentage of significant results from 83% (not counting marginally significant, p < .1, results) to 53%. Thus, the expected discovery decreases, but the more stringent criterion for significance lowers the false discovery risk to 4% and even the upper limit of the 95%CI is just 4%.

It is likely that discovery rates vary across journals and disciplines (Schimmack, 2021). In the future, it may be possible to make more specific recommendations for different disciplines or journals based on their discovery rates. Journals that publish riskier hypotheses tests or studies with modest power would need a more stringent significance criterion to maintain an acceptable false discovery risk.

An alpha level of .01 is also recommended by Simmons et al.’s (2011) simulation studies of p-hacking. Massive p-hacking that inflates the false positive risk from 5% to 61% produces only 22% false positives with alpha = .01. Milder forms of p-hacking inflates the false positive risk produces only a probability of 8% to obtain a p-value below .01. Ideally, open science practices like pre-registration will curb the use of questionable practices in the future. Increasing sample sizes will also help to lower the false positive risk. A z-curve analysis of new studies can be used to estimate the current false discovery risk and may suggest that even the traditional alpha level of .05 is able to maintain a false discovery risk below 5%.

While the present results may be considered good news relative to the scenario that most published results cannot be trusted, the results do not change the fact that some areas of psychology have a replication crisis (Open Science Collaboration, 2015). The z-curve results show clear evidence of selection for significance, which leads to inflated effect size estimates. Studies suggest that effect sizes are often inflated by more than 100% (Open Science Collaboration, 2015). Thus, published effect size estimates cannot be trusted even if p-values below .01 show the correct sign of an effect. The present results also imply that effect size meta-analyses that did not correct for publication bias produce inflated effect size estimates. For these reasons, many meta-analyses have to be reexamined and use statistical tools that correct for publication bias.

Appendix

“Given that these publishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may
affect at least as much, if not even more so, the most prominent journals” (Button et al., 2013; 3,316 citations).

“In a theoretical analysis, Ioannidis estimated that publishing and analytic practices make it likely that more than half of research results are false and therefore irreproducible” (Open Science Collaboration, 2015, aac4716-1)

“There is increasing concern that most current published research findings are false. (Ioannidis,
2005, abstract)” (Cumming, 2014, p7, 1,633 citations).

“In a recent article, Simmons, Nelson, and Simonsohn (2011) showed how, due to the misuse of statistical tools, significant results could easily turn out to be false positives (i.e., effects considered significant whereas the null hypothesis is actually true). (Leys et al., 2013, p. 765, 1,406 citations)

“During data analysis it can be difficult for researchers to recognize P-hacking or data dredging because confirmation and hindsight biases can encourage the acceptance of outcomes that fit expectations or desires as appropriate, and the rejection of outcomes that do not as the result of suboptimal designs or analyses. Hypotheses may emerge that fit the data and are then reported without indication or recognition of their post hoc origin. This, unfortunately, is not scientific discovery, but self-deception. Uncontrolled, it can dramatically increase the false discovery rate” (Munafò et al., 2017, p. 2, 1,010 citations)

Just how dramatic these effects can be was demonstrated by Simmons, Nelson, and Simonsohn (2011) in a series of experiments and simulations that showed how greatly QRPs increase the likelihood of finding support for a false hypothesis. (John et al., 2012, p. 524, 877 citations).

“Simonsohn’s simulations have shown that changes in a few data-analysis
decisions can increase the
false-positive rate in a single study to 60%” (Nuzzo, 2014, 799 citations).

“the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011)” (Pashler & Wagenmakers, 2012, p. 528, 736 citations)

“Even seemingly conservative levels of p-hacking make it easy for researchers to find statistically significant support for nonexistent effects. Indeed, p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011).” (Simonsohn, Nelson, & Simmons, 2014, p. 534, 656 citations)

“Recent years have seen intense interest in the reproducibility of scientific results and the degree to which some problematic, but common, research practices may be responsible for high rates of false findings in the scientific literature, particularly within psychology but also more generally” (Poldrack et al., 2017, p. 115, 475 citations)

“especially in an environment in which multiple comparisons or researcher dfs (Simmons, Nelson, & Simonsohn, 2011) make it easy for researchers to find large and statistically significant effects that could arise from noise alone” (Gelman & Carlin,

“In an influential recent study, Simmons and colleagues demonstrated that even a moderate amount of flexibility in analysis choice—for example, selecting from among two DVs or
optionally including covariates in a regression analysis— could easily produce false-positive rates in excess of 60%, a figure they convincingly argue is probably a conservative
estimate (Simmons et al., 2011).” (Yarkoni & Westfall, 2017, p. 1103, 457 citations)

“In the face of human biases and the vested interest of the experimenter, such freedom of analysis provides access to a Pandora’s box of tricks that can be used to achieve any desired result (e.g., John et al., 2012; Simmons, Nelson, & Simonsohn, 2011″ (Wagenmakers et al., 2012, p. 633, 425 citations)

“Simmons et al. (2011) illustrated how easy it is to inflate Type I error rates when researchers employ hidden degrees of freedom in their analyses and design of studies (e.g., selecting the most desirable outcomes, letting the sample size depend on results of significance tests).” (Bakker et al., 2012, p. 545, 394 citations).

“Psychologists have recently become increasingly concerned about the likely overabundance of false positive results in the scientific literature. For example, Simmons, Nelson, and Simonsohn (2011) state that “In many cases, a researcher is more likely to falsely find
evidence that an effect exists than to correctly find evidence that it does not
” (p. 1359)” (Maxwell, Lau, & Howard, 2015, p. 487,

“More-over, the highest impact journals famously tend to favor highly surprising results; this makes it easy to see how the proportion of false positive findings could be even higher in such journals.” (Pashler & Harris, 2012, p. 532, 373 citations)

“There is increasing concern that many published results are false positives [1,2] (but see [3]).” (Head et al., 2015, p. 1, 356 citations)

“Quantifying p-hacking is important because publication of false positives hinders scientific
progress” (Head et al., 2015, p. 2, 356 citations).

“To be sure, methodological discussions are important for any discipline, and both fraud and dubious research procedures are damaging to the image of any field and potentially undermine confidence in the validity of social psychological research findings. Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology.” (Strobe & Strack, 2014, p. 60, 291 citations)

“Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature” (Szucs & Ioannidis, 2017, p. 1, 269 citations)

“Notably, if we consider the recent estimate of 13:1 H0:H1 odds [30], then FRP exceeds 50% even in the absence of bias” (Szucs & Ioannidis, 2017, p. 12, 269 citations)

“In all, the combination of low power, selective reporting, and other biases and errors that have been well documented suggest that high FRP can be expected in cognitive neuroscience and psychology. For example, if we consider the recent estimate of 13:1 H0:H1 odds [30], then
FRP exceeds 50% even in the absence of bias.” (Szucs & Ioannidis, 2017, p. 15, 269 citations)

“Many prominent researchers believe that as much as half of the scientific literature—not only in medicine, by also in psychology and other fields—may be wrong [11,13–15]” (Smaldino & McElreath, 2016, p. 2, 251 citations).

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

“A more recent article compellingly demonstrated how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates (Simmons, Nelson, & Simonsohn, 2011).” (Dick et al., 2015, p. 43, 208 citations)

“In 2011, we wrote “False-Positive Psychology” (Simmons et al. 2011), an article reporting the surprisingly severe consequences of selectively reporting data and analyses, a practice that we later called p-hacking. In that article, we showed that conducting multiple analyses on the same data set and then reporting only the one(s) that obtained statistical significance (e.g., analyzing multiple measures but reporting only one) can dramatically increase the likelihood of publishing a false-positive finding. Independently and nearly simultaneously, John et al. (2012) documented that a large fraction of psychological researchers admitted engaging in precisely the forms of p-hacking that we had considered. Identifying these realities—that researchers engage in p-hacking and that p-hacking makes it trivially easy to accumulate significant evidence for a false hypothesisopened psychologists’ eyes to the fact that many published findings, and even whole literatures, could be false positive.” (Nelson, Simmons, & Simonsohn, 2018, 204 citations).

“As Simmons et al.(2011) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis”(p.1359)” (Earp & Trafimov, 2015, p. 4, 200 citations)

“The second, related set of events was the publication of articles by a series of authors (Ioannidis 2005, Kerr 1998, Simmons et al. 2011, Vul et al. 2009) criticizing questionable research practices (QRPs) that result in grossly inflated false positive error rates in the psychological literature” (Shrout & Rodgers, 2018, p. 489, 195 citations).

“Let us add a new dimension, which was brought up in a seminal publication of Simmons, Nelson & Simonsohn (2011). They stated that researchers actually have so much flexibility in deciding how to analyse their data that this flexibility allows them to coax statistically significant results from nearly any data set” (Forstmeier, Wagenmakers, & Parker, 2017, p. 1945, 173 citations)

“Publication bias (Ioannidis, 2005) and flexibility during data analyses (Simmons, Nelson, & Simonsohn, 2011) create a situation in which false positives are easy to publish, whereas contradictory null findings do not reach scientific journals (but see Nosek & Lakens, in press)” (Lakens & Evers, 2014, p. 278, 139 citations)

“Recent reports hold that allegedly common research practices allow psychologists to support just about any conclusion (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011).” (Koole & Lakens, 2012, p. 608, 139 citations)

“Researchers then may be tempted to write up and concoct papers around the significant results and send them to journals for publication. This outcome selection seems to be widespread practice in psychology [12], which implies a lot of false positive results in the literature and a massive overestimation of ES, especially in meta-analyses” (

“Researcher df, or researchers’ behavior directed at obtaining statistically significant results (Simonsohn, Nelson, & Simmons, 2013), which is also known as p-hacking or questionable research practices in the context of null hypothesis significance testing (e.g., O’Boyle, Banks, & Gonzalez-Mulé, 2014), results in a higher frequency of studies with false positives (Simmons et al., 2011) and inflates genuine effects (Bakker et al., 2012).” (van Assen, van Aert, & Wicherts, p. 294, 133 citations)

“The scientific community has witnessed growing concern about the high rate of false positives and unreliable results within the psychological literature, but the harmful impact
of false negatives has been largely ignored” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Much of the debate has concerned habits (such as “phacking” and the filedrawer effect) which can boost the prevalence of false positives in the published literature (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Simmons, Nelson, & Simonsohn, 2011).” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Simmons, Nelson, and Simonsohn (2011) showed that researchers without scruples can nearly always find a p < .05 in a data set if they set their minds to it.” (Crandall & Sherman, 2014, p. 96, 114 citations)

Science is self-correcting: JPSP-PPID is not

With over 7,000 citations at the end of 2021, Ryff and Keyes (1995) article is one of the most highly cited articles in the Journal of Personality and Social Psychology. A trend analysis shows that citations are still increasing with over 800 citations in the past two years.

Most of these citations are reference to the use of Ryff’s measure of psychological well-being that uncritically accept Ryff’s assertion that her PWB measure is a valid measure of psychological well-being. The abstract implies that the authors provided empirical support for Ryff’s theory of psychological well-being.

Contemporary psychologists contrast Ryff’s psychological well-being (PWB) with Diener’s (1984) subjective well-being (SWB). In an article with over 1,000 citations, Ryff and Keyes (2002) tried to examine how PWB and SWB are empirically related. This attempt resulted in a two-factor model that postulates that SWB and PWB are related, but distinct forms of well-being.

The general acceptance of this model shows that most psychologists lack proper training in the interpretation of structural equation models (Borsboom, 2006), although graphic representations of these models make SEM accessible to readers who are not familiar with matrix algebra. To interpret an SEM model, it is only necessary to know that boxes represent measured variables, ovals represent unmeasured constructs, directed straight arrows represent an assumption that one construct has a causal influence on another construct, and curved bidrectional arrows imply an unmeasured common cause.

Starting from the top, we see that the model implies that an unmeasured common cause produces a strong correlation between two unmeasured variables that are labelled Psychological Well-Being and Subjective Well-Being. These labels imply that the constructs PWB and SWB are represented by unmeasured variables. The direct causal arrows from these unmeasured variables to the measured variables imply that PWB and SWB can be measured because the measured variables reflect the unmeasured variables to some extent. This is called a reflective measurement model (Borsboom et al., 2003). For example, autonomy is a measure of PWB because .38^2 = 14% of the variance in autonomy scores reflect PWB. Of course, this makes autonomy a poor indicator of PWB because the remaining 86% of the variance do not reflect the influence of PWB. This variance in autonomy is caused by other unmeasured influences and is called unique variance, residual variance, or disturbance. It is often omitted from SEM figures because it is assumed that this variance is simply irrelevant measurement error. I added it here because Ryff and users of her measure clearly do not think that 86% of the variance in the autonomy scale is just measurement error. In fact, the scale scores of autonomy are often used as if they are a 100% valid measure of autonomy. The proper interpretation of the model is therefore that autonomy is measured with high validity, but that variation in autonomy is only a poor indicator of psychological well-being.

Examination of the factor loadings (i.e., the numbers next to the arrows from PWB to the six indicators) shows that personal relationships has the highest validity as a measure of PWB, but even for personal relationships, the amount of PWB variance is only .66^2 = 44%.

In a manuscript (doc) that was desk-rejected by JPSP, we challenged this widely accepted model of PWB. We argued that the reflective model does not fit Ryff’s own theory of PWB. In a nutshell, Ryff’s theory of PWB is one of many list-theories of well-being (Sumner, 1996). The theory lists a number of attributes that are assumed to be necessary and sufficient for high well-being.

This theory of well-being implies a different measurement model in which arrows point from the measured variables to the construct of PWB. In psychometrics, these models are called formative measurement models. There is nothing unobserved about formative constructs. They are merely a combination of the measured constructs. The simplest way to integrate information about the components of PWB is to average them. If assumptions about importance are added, the construct could be a weighted average. This model is shown in Figure 2.

The key problem for this model is that it makes no predictions about the pattern of correlations among the measured variables. For example, Ryff’s theory does not postulate whether an increase in autonomy produces an increase in personal growth or a decrease in personal relations. At best, the distinction between PWB and SWB might imply that changes in PWB components are independent of changes in SWB components, but this assumption is highly questionable. For example, some studies suggest that positive relationships improve subjective well-being (Schimmack & Lucas, 2010).

To conclude, JPSP has published two highly cited articles that fitted a reflective measurement model to PWB indicators. In the desk-rejected manuscript, Jason Payne and I presented a new model that is grounded in theories of well-being and that treats PWB dimensions like autonomy and positive relations as possible components of a good life. Our model also clarified the confusion about Diener’s (1984) model of subjective well-being.

Ryff et al.’s (2002) two-factor model of well-being was influenced by Ryan and Deci’s (2001) distinction between two broad traditions in well-being research. “one dealing with happiness (hedonic well-being), and one dealing with human potential (eudaimonic well-being; Ryan &
Deci, 2001; see also Waterman, 1993)” (Ryff et al., 2002, p. 1007). We argued that this dichotomy overlooks another important distinction between well-being theories, namely the distinction between subjective and objective theories of well-being (Sumner, 1996). The key difference between objective and subjective theories of well-being is that objective theories aim to specify universal aspects of a good life that are based on philosophical analyses of the good life. In contrast, subjective theories reject the notion that universal criteria of a good life exist and leave it to individuals to create their own evaluation standards of a good life (Cantril., 1965). Unfortunately, Diener’s tripartite model of SWB is difficult to classify because it combines objective and subjective indicators. Whereas life-evaluations like life-satisfaction judgments are clearly subjective indicators, the amount of positive affect and negative affect implies a hedonistic conception of well-being. Diener never resolved this contradiction (Busseri & Sadava, 2011), but his writing made it clear that Diener stressed subjectivity as an essential component of well-being.

It is therefore incorrect to characterize Diener’s concept of SWB as a hedonic or hedonistic conception of well-being. The key contribution of Diener was to introduce psychologists to subjective conceptions of well-being and to publish the most widely used subjective measure of well-being, namely the Satisfaction with Life Scale. In my opinion, the inclusion of PA and NA in the tripartite model was a mistake because it does not allow individuals to choose what they want to do with their lives. Even Diener himself published articles that suggested positive affect and negative affect are not essential for all people (Suh, Diener, Oishi, & Triandis, 1998). At the very least, it remains an empirical question how important positive affect and negative affect are for subjective life evaluations and whether other aspects of a good life are even more important. At least, this question can be empirically tested by examining how much eudaimonic and hedonic measures of well-being contribute to variation in subjective measures of well-being. This question leads to a model in which life-satisfaction judgments are a criterion variable and the other variables are predictor variables.

The most surprising finding was that environmental mastery was a strong unique predictor and a much stronger predictor than positive affect or negative affect (direct effect, b = .66).

In our model, we also allowed for the possibility that PWB attributes influence subjective well-being by increasing positive affect or decreasing negative affect. The total effect is a very strong relationship, b = .78, with more than 50% of the variance in life-satisfaction being explained by a single PWB dimension, namely environmental mastery.

Other noteworthy findings were that none of the other PWB attribute made a positive (direct or indirect) contribution to life-satisfaction judgments. Autonomy even was a negative predictor. The effects of positive affect and negative affect were statistically significant, but small. This suggests that PA and NA are meaningful indicators of subjective well-being because the reflect a good life, but provide no evidence for hedonic theories of well-being that suggest positive affect increases well-being no matter how it is elicited.

These results are dramatically different from the published model in JPSP. In that model an unmeasured construct, SWB, causes variation in Environmental Mastery. In our model, environmental mastery is a strong cause of the only subjective indicator of well-being, namely life-satisfaction judgments. Whereas the published model implies that feeling good makes people have environmental mastery, our model suggests that having control over one’s life increases well-being. Call us crazy, but we think the latter model makes more sense.

So, why was our ms. desk rejected without peer-review from experts in well-being research? I post the full decision letter below, but I want to highlight the only comment about our actual work.

A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.

Bleidorn’s comments show that even prominent personality researchers lack basic understanding of psychometrics and construct validation. For example, it is not clear how longitudinal data can provide answers to questions about construct validity. Examining change is of course useful, but without a valid measure of a construct it is not clear what change in scale scores means. Construct validation precedes studies of stability and change. Similarly, it is only relevant to examine nature and nurture questions with a clear phenotype. Bleidorn completely ignores our distinction between hedonic and subjective well-being and the fact that we are the first to examine the relationship between PWB attributes and life-satisfaction.

As psychometricians have pointed out, personality psychologists often ignore measurement questions and are often content with averaged self-report ratings as operationalized constructs that do not require further validation. We think that this blind empiricism is preventing personality psychology from making real progress. It is depressing to see that even the new generation of personality psychologists shows no interest in improving construct validity of foundational constructs. Fortunately, JPSP-PPID publishes only about 50 articles a year and there are other outlets to publish our work. Unfortunately, JPSP has a reputation to publish only the best work, but this is prestige is not warranted by the actual quality of published articles. For example, the obsession with longitudinal data is not warranted given evidence that about 80% of the variance in personality measures is stable trait variance that does not change. Repeatedly measuring this trait variance does not add to our understanding of stable traits.

Conclusion

To conclude, JPSP has published two cross-sectional articles of the structure of well-being that continue to be highly cited. We find major problems with the models in these articles, but JPSP is not interested in publishing a criticism of these articles. To reiterate, the main problem is that Diener’s SWB model is treated as if it is an objective hedonic theory of well-being, when the core aspect of the model is that well-being is subjective and not objective. We thought at least the main editor Rich Lucas, a former Diener student, would understand this point, but expectations are the mother of disappointment. Of course, we could be wrong about some minor or major issues, but the lack of interest in these foundational questions shows just how far psychology is from being a real science. A real science develops valid measures before it examines real questions. Psychologists invent measures and study their measures without evidence that their measures reflect important constructs like well-being. Not surprisingly, psychology has produced no consensual theory of well-being that could help people live better lives. This does not stop psychologists from making proclamations about ways to lead a happy or good life. The problem is that these recommendations are all contingent on researchers’ preferred definition of well-being and the measures associated with that tradition/camp/belief system. In this way, psychology is more like (other) religions and less like a science.

Decision Letter

I am writing about your manuscript “Two Concepts of Wellbeing: The Relation Between Psychological and Subjective Wellbeing”, submitted for publication in the Journal of Personality and Social Psychology (JPSP). I have read the manuscript carefully myself, as has the lead Editor at JPSP, Rich Lucas. We read the manuscript independently and then consulted with each other about whether the manuscript meets the threshold for full review. Based on our joint consultation, I have made the decision to reject your paper without sending it for external review. The Editor and I shared a number of concerns about the manuscript that make it unlikely to be accepted for publication and that reduce its potential contribution to the literature. I will elaborate on these concerns below. Due to the high volume of submissions and limited pages available to JPSP, we must limit our acceptances to manuscripts for which there is a general consensus that the contribution is of an important and highly significant level. 
 

  1. Most importantly, papers that rely solely on cross-sectional designs and self-report questionnaire techniques are less and less likely to be accepted here as the number of submissions increases. In fact, such papers are almost always rejected without review at this journal. Although such studies provide an important first step in the understanding of a construct or phenomenon, they have some important limitations. Therefore, we have somewhat higher expectations regarding the size and the novelty of the contribution that such studies can make. To pass threshold at JPSP, I think you would need to expand this work in some way, either by using longitudinal data or or by going further in your investigation of the processes underlying these associations. I want to be clear; I agree that studies like this have value (and I also conduct studies using these methods myself), it is just that many submissions now go beyond these approaches in some way, and because competition for space here is so high, those submissions are prioritized.
  2. A related concern has to do with a noticeable gap between your research question, theoretical framework, and research design. The introduction paints your question in broad strokes only, but my understanding is that you attempt to refine our understanding of the structure of well-being, which could be an important contribution to the literature. However, the introduction does not provide a clear rationale for the alternative model presented. Perhaps even more important, the cross-sectional correlational study of one U.S. sample is not suited to provide strong conclusions about the structure of well-being. At the very least, I would have expected to see model comparison tests to compare the fit of the presented model with those of alternative models. In addition, I would have liked to see a replication in an independent sample as well as more critical tests of the discriminant validity and links between these factors, perhaps in longitudinal data, through the prediction of critical outcomes, or by using behavioral genetic data to establish the genetic and environmental architecture of these factors. Put another way, independent of the validity of the Ryff / Keyes model, the presented theory and data did not convince me that your model is a better presentation of the structure of well-being.
  3. The use of a selected set of items rather than the full questionnaires raises concerns about over-fitting and complicate comparisons with other studies in this area. I recommend using complete questionnaires and – should you decide to collect more data – additional measures of well-being to capture the universe of well-being content as best as you can. 
  4. I noticed that you tend to use causal language in the description of correlations, e.g. between personality traits and well-being measures. As you certainly know, the data presented here do not permit conclusions about the temporal or causal influence of e.g., neuroticism on negative affect or vice versa and I recommend changing this language to better reflect the correlational nature of your data.     

In closing, I am sorry that I cannot be more positive about the current submission. I hope my comments prove helpful to you in your future research efforts. I wish you the very best of luck in your continuing scholarly endeavors and hope that you will continue to consider JPSP as an outlet for your work.

Sincerely,
Wiebke Bleidorn, PhD
Associate Editor
Journal of Personality and Social Psychology: Personality Processes and Individual Differences

Estimating the False Positive Risk in Psychological Science

Abstract: At most one-quarter of published significant results in psychology journals are false positive results. This is surprising news after a decade of false positive paranoia. However, the low positive rate is not a cause for celebration. It mainly reflects the low priori probability that the nil-hypothesis is true (Cohen, 1994). To produce meaningful results, psychologists need to maintain low false positive risks when they test stronger hypotheses that specify a minimum effect size.

Introduction

Like many other sciences, psychological science relies on null-hypothesis significance testing as the main statistical approach to draw inferences from data. This approach can be dated back to Fisher’s first manual for empirical researchers how to conduct statistical analyses. If the observed test-statistic produces a p-value below .05, the null-hypothesis can be rejected in favor of the alternative hypothesis that the population effect size is not zero. Many criticism of this statistical approach have failed to change research practices.

Cohen (1994) wrote a sarcastic article about NHST with the title “The Earth is round, p < .05.” In this article, Cohen made the bold claim “my work on power analysis has led me to realize that the nil-hypothesis is always false.” In other words, population effect sizes are unlikely to be exactly zero. Thus, rejecting the nil-hypothesis with a p-value below .05 only tells us something we already know. Moreover, when sample sizes are small, we often end up with p-values greater than .05 that do not allow us to reject a false null-hypothesis. I cite this article only to point out that in the 1990s, meta-psychologists were concerned with low statistical power because it produces many false negative results. In contrast, significant results were considered to be true positive findings. Although often meaningless (e.g., the amount of explained variance is greater than zero), they were not wrong.

Since then, psychology has encountered a racial shift in concerns about false positive results (i.e., significant p-values when the nil-hypothesis is true). I conducted an informal survey on social media. Only 23.7% of twitter respondents echoed Cohen’s view that false positive results are rare (less than 25%). The majority (52.6%) of respondents assumed that more than half of all published significant results are false positives.

The results were a bit different for the poll in the Psychological Methods Discussion Group on Facebook. Here the majority opted for 25 to 50 percent false positive results.

The shift from the 1990s to the 2020s can be explained by the replication crisis in social psychology that has attracted a lot of attention and has been generalized to all areas of psychology (Open Science Collaboration, 2015). Arguably, the most influential article that contributed to concerns about false positive results in psychology is Simmons, Nelsons, and Simonsohn’s (2011) article titled “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” that has been cited 3,203. The key contribution of this article was to show that the use of questionable research practices that psychologists use to obtain p-values below .05 (e.g., using multiple dependent variables) can increase the risk of a false positive result from 5% to over 60%. Moreover, anonymous surveys suggested that researchers often engage in these practices (John et al., 2012). However, even massive use of QRPs will not produce a massive amount of false positive results, if most null-hypotheses are true. In this case, QRPs will inflate the effect size estimates (that nobody pays attention to, anyways), but the rate of false positive results will remain low if most tested hypotheses are true.

Some scientists have argued that scientists are much more likely to make false assumptions (e.g., the Earth is flat) than Cohen envisioned. Ioannidis (2005) famously declared that Most published results are false. He based this claim on hypothetical scenarios that produce more than 50% false positive results when 90% of studies test a true null-hypothesis. This assumption is a near complete reversal of Cohen’s assumption that we can nearly always assume that the effect size is not zero. The problem is that the actual rate of true and false hypotheses is unknown. Thus, estimates of false positive rates are essentially projective tests of gullibility and cynicism.

To provide psychologists with scientific information about the false positive risk in their science, we need a scientific method that can estimate the false discovery risk based on actual data rather than hypothetical scenarios. There have been several attempts to do so. So far, the most prominent study was Leek and Jager’s (2014) estimate of the false discovery rate in medicine. They obtained an estimate of 14%. Simulation studies showed some problems with their estimation model, but the superior z-curve method replicated the original result with a false discovery risk of 13%. This result is much more in line with Cohen’s view that most null-hypotheses are false (typically effect sizes are not zero) than with Ioannidis’s claim that the null-hypothesis is true in 90% of all significance tests.

In psychology, the focus has been on replication rates. The shocking finding was that only 25% of significant results in social psychology could be replicated in an honest and unbiased attempt to reproduce the original study (Open Science Collaboration, 2015). This low replication rate leaves ample room for false positive results, but it is unclear how many of the non-significant results were caused by a true null-hypothesis and how many were caused by low statistical power to detect an effect size greater than zero. Thus, this project provides no information about the false positive risk in psychological science.

Another noteworthy project used a representative sample of test results in social psychology journals (Motyl et al., 2017). This project produced over 1,000 p-values that were examined using a number of statistical tools available at that time. The key result was that there was clear evidence of publication bias. That is, focal hypothesis tests nearly always rejected the null-hypothesis, a finding that has been observed since the beginning of social psychology (Sterling, 1959). However, the actual power of studies to do so was much lower; a finding that is consistent with Cohen’s (1961) seminal analysis of power. However, the results provided no information about the false positive risk. Yet, this valuable dataset could be analyzed with statistical tools that estimate the false discovery risk (Schimmack, 2021). However, the number of significant p-values was too small to produce an informative estimate of the false discovery risk (k = 678; 95CI = .09 to .82).

Results

A decade after the “False Positive Psychology” article rocked psychological science, it remains unclear how much false positive results contribute to replication failures in psychology. To answer this question, we report the results of a z-curve analysis of 1,857 significant p-values that were obtained from hand-coding a representative sample of studies that were published between 2009 and 2014. The years 2013 and 2014 were included to incorporate Motyl et al.’s data. All other coding efforts focussed on the years 2009 and 2010, before concerns about replication failures could have changed research practices. In marked contrast to previous initiatives, the aim was to cover all areas of psychology. To obtain a broad range of disciplines in psychology, a list of 120 journals was compiled (Schimmack, 2021). These journals are the top journals of their disciplines with high impact factors. Students had some freedom in picking journals of their choice. For each journal, articles were selected based on a fixed sampling scheme to code articles 1, 3, 6, and 10 for every set of 10 articles (1,3,6,10,11,13…). The project is ongoing and the results reported below should be considered preliminary. Yet, they do present the first estimate of the false discovery risk in psychological science.

The results replicate many other findings that focal statistical tests are selected because they reject the null-hypothesis. Eighty-one percent of all tests had a p-value below .05. When marginally significant results are included as well, the observed discovery rate increases to 90%. However, the statistical power of studies does not warrant such high success rates. The z-curve estimate of mean power before selection for significance is only 31%; 95%CI = 19% to 37%. This statistic is called the expected discovery rate (EDR) because mean power is equivalent to the long-run percentage of significant results. Based on an insight by Soric (1989), we can use the EDR to quantify the maximum percentage of results that can be false positives, using the formula: FDR = (1/EDR – 1)*(alpha/(1-alpha)). The point estimate of the EDR of 31% corresponds to a point estimate of the False Discovery Risk of 12%. The 95%CI ranges from 8% to 28%. It is important to distinguish between the risk and rate of false positives. Soric’s method assumes that true hypotheses are tested with 100% power. This is an unrealistic assumption. When power is lower the false positive rate will be lower than the false positive risk. Thus, we can conclude from these results that it is unlikely that more than 25% of published significant results in psychology journals are false positive results.

One concerns about these results is that the number of test statistics differed across journals and that Motyl et al.’s large set of results from social psychology could have biased the results. We therefore also analyzed the data by journal and then computed the mean FDR and its 95%CI. This approach produced an even lower FDR estimate of 11%, 95%CI = 9% to 15%.

While a FDR of less than 25% may seem good news in a field that is suffering from false positive paranoia, it is still unacceptably high to ensure that published results can be trusted. Fortunately, there is a simple solution to this problem because Soric’s formula shows that the false discovery risk depends on alpha. Lowering alpha to .01 is sufficient to produce a false discovery risk below 5%. Although this seems like a small adjustment, it results in the loss of 37% significant results with p-values between .05 and .01. This recommendation is consistent with two papers that have argued against the blind use of Fisher’s alpha level of .05 (Benjamin et al., 2017; Lakens et al., 2018). The cost of lowering alpha to .005 would be to loss another 10% of significant findings (ODR = 47%).

Limitations and Future Directions

No study is perfect. As many women know, the first time is rarely the best time (Higgins et al., 2010). Similarly, this study has some limitations that need to be addressed in future studies.

The main limitation of this study is that the coded statistical tests may not be representative of psychological science. However, the random sampling from journals and the selection of a broad range of journals suggests that sampling bias has a relatively small effect on the results. A more serious problem is that there is likely to be heterogeneity across disciplines or even journals within disciplines. Larger samples are needed to test those moderator effects.

Another problem is that z-curve estimates of the EDR and FDR make assumptions about the selection process that may differ from the actual selection process. The best way to address this problem is to promote open science practices that reduce the selective publishing of statistically significant results.

Eventually, it will be necessary to conduct empirical tests with a representative sample of results published in psychology akin to the reproducibility project (Open Science Collaboration, 2015). At a first step, studies can be replicated with the original sample sizes. Results that are successfully replicated do not require further investigation. Replication failures need to be followed up with studies that can provide evidence for the null-hypothesis using equivalence testing with a minimum effect size that would be relevant (Lakens, Scheel, and Isager, 2018). This is the only way to estimate the false positive risk by means of replication studies.

Implications: What Would Cohen Say

The finding that most published results are not false may sound like good news for psychology. However, Cohen would merely point out that that a low rate of false positive results merely reflect the fact that the nil-hypothesis is rarely true. If some hypotheses were true and others were false, NHST (without QRPs) could be used to distinguish between them. However, if most effect sizes are greater than zero, not much is learned from statistical significance. The problem is not p-values or dichotomous think. The problem is that nobody is testing risky hypothesis that an effect size is of a minimum size, and decides in favor of the null-hypothesis when the data show the population effect size is not exactly zero, but practically meaningless (e.g., experimental ego-depletion effects are less than 1/10th of a standard deviation). Even specifying H0 as r < .05 or d < .01 would lower the discovery rates and increase the false discovery risk, while increasing the value of a statistically significance.

Cohen’s clear distinction between the null-hypothesis and the nil-hypothesis made it clear that nil-hypothesis testing is a ritual with little scientific value, while null-hypothesis testing is needed to advance psychological science. The past decade has been a distraction by suggesting that nil-hypothesis testing is meaningful, but only if open science practices are used to prevent false positive results. However, open science practices do not change the fundamental problem of nil-hypothesis testing that Cohen and others identified more than two decades ago. It is often said that science is self-correcting, but psychologists have not corrected the way they formulate their hypotheses. If psychology wants to be a science, they need to specify hypotheses that are worthy of empirical falsification. I am getting to old and cynical (much like my hero Cohen in the 1990s) to believe in change in my life-time, but I can write this message in a bottle and hope one day a new generation may find it and do something with it.

Open Science: Inside Peer-Review at PSPB

We submitted a ms. that showed problems with the validity of the race IAT as a measure of African Americans’ unconscious attitudes to PSPB (Schimmack & Howard, 2020). After waiting patiently for three months, we received the following decision letter from the acting editor Dr. Corinne Moss-Racusin at Personality and Social Psychology Bulletin. She assures us that she independently read our manuscript carefully – twice; once before and once after reading the reviews. This is admirable. Yet it is surprising that her independent reading of our manuscript places her in strong agreement with the reviewers. Somebody with less research experience might feel devastated by the independent evaluation by three experts that our work is “of low quality.” Fortunately, it is possible to evaluate the contribution of our manuscript from another independent perspective, namely the strength of the science.

The key claim of our ms. is simple. John Jost, Brian Nosek, and Mahzarin Banaji wrote a highly cited article that contained the claim that a large percentage of members of disadvantaged groups have an implicit preference for the out-group. As recently as 2019, Jost repeated this claim and used the term self-hatred to refer to implicit preferences for the in-group (Jost, 2019).

We expressed our doubt about this claim when the disadvantaged group are African Americans. Our main concern was that any claims about African Americans’ implicit preferences require a valid measure of African Americans’ preferences. The claim that a large number of African Americans have an implicit preference for the White outgroup rests entirely on results obtained with the Implicit Association Test (Jost, Nosek, & Banaji, 2004). However, since the 2004 publication, the validity of the race IAT as a measure of implicit preferences has been questioned in numerous publications, including my recent demonstration that implicit and explicit measures of prejudice lack discriminant validity (Schimmack, 2021). Even the author of the IAT is no longer supporting the claim that the race IAT is a measure of some implicit, hidden attitudes (Greenwald & Banaji, 2017). Aside from revisiting Jost et al.’s (2004) findings in light of doubts about the race IAT, we also conducted the first attempt at validating the race IAT for Black participants. Apparently, reading the article twice did not help Corinne Moss-Racusin to notice this new empirical contribution, even though it is highlighted in Figure 2. The key finding here is that we were able to identify an in-group preference factor because several explicit and implicit measures showed convergent validity (ig). For example, the evaluative priming task showed some validity with a factor loading of .42 in the Black sample. However, the race IAT failed to show any relationship with the in-group factor (p > .05). It was also unrelated to the out-group factor. Thus, the race IAT lacks convergent validity as a measure of in-group and out-group preferences among African Americans in this sample. Neither the two reviewers, nor Corinne Moss-Racusin challenge this finding. They do not even comment on it. Instead, they proclaim that this research is of low quality. I beg to differ. Based on any sensible understanding of the scientific method, it is unscientific to make claims about African Americans’ preferences based on a measure that has not been validated. It is even more unscientific to double down on a false claim when evidence is presented that the measure lacks validity.

Of course, one can question whether PSPB should publish this finding. After all, PSPB prides itself on being the flagship journal of the Society for Personality and Social Psychology (Robinson et al., 2021). Maybe valid measurement of African Americans’ attitudes is not relevant enough to meet the high standards of a 20% acceptance rate. However, Robinson et al. (2021) launched a diversity initiative in response to awareness that psychology has a diversity problem.

Maybe it will take some time before PSPB can find some associate editors to handle manuscripts that address diversity issues and are concerned with the well-being of African Americans. Meanwhile, we are going to find another outlet to publish our critique of Jost and colleagues unscientific claim that many African Americans hold negative views of their in-group that they are not aware of and can only be revealed by their scores on the race IAT.

Editorial Decision Letter from Corinne Moss-Racusin

Re: “The race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes towards their In-Group” (MS # PSPB-21-365)

Dear Dr. Schimmack:

Thank you for submitting your manuscript for consideration to Personality and Social Psychology Bulletin. I would like to apologize for the slight delay in getting this decision out to you. Both of my very young children have been home with me for the past month, due to Covid exposures at their schools. As their primary caregiver, this has created considerable difficulties. I appreciate your understanding as we all work to navigate these difficult and unprecedented times.

I have now obtained evaluations of the paper from two experts who are well-qualified to review work in this area.  Furthermore, I read your paper carefully and independently, both before and after looking at the reviews.

I found the topic of your work to be important and timely—indeed, I read the current paper with great interest. Disentangling in-group and out-group racial biases, among both White and Black participants (within the broader context of exploring System Justification Theory) is a compelling goal. Further, I strongly agree with you that exploring whether Black participants’ in-group attitudes have been systematically misrepresented by the (majority White) scientific community is of critical importance.

Unfortunately, as you will see, both reviewers have significant, well-articulated concerns that prevent them from supporting publication of the manuscript. For example, reviewer 1 stated that “Overall, I found this article to be of low quality. It argues against an argument that researchers haven’t made and landed on conclusions that their data doesn’t support.” Further, reviewer 2 (whose review is appropriately signed) wrote clearly that, “The purpose of this submission, it seems to me, is not to illuminate anything, really, and indeed very little, if anything, is illuminated. The purpose of the paper, it seems, is to create the appearance of something scandalous and awful and perhaps even racist in the research literature when, in fact, the substantive results obtained here are very similar to what has been found before. And if the authors really want to declare that the race-based IAT is a completely useless measure, they have a lot more work to do than re-analyzing previously published data from one relatively small study.”

See Reviewer 2’s comments and my response here

My own reading of your paper places me in strong agreement with the reviewer’s evaluations. I am sorry to report that I will not be able to accept your paper for publication in PSPB.

The reviewers’ comments are, in my several years of experience as an editor, unusually thorough and detailed. Thus, I will not reiterate them here.  Nevertheless, issues of primary concern involved both conceptual and empirical aspects of the manuscript. Although some of these issues might be addressed, to some degree, with some considerable re-thinking and re-writing, many cannot be addressed without more data and theoretical overhaul.

I was struck by the degree to which claims appear to stray quite far from both the published literature and the data at hand. As just one example, the section on “African American’s Resilience in a Culture of Oppression” (pp. 5-6) cites no published work whatsoever. Rather, you note that your skepticism regarding key components of SJT is based on “the lived experience of the second author,” which you then summarize. While individual case studies such as this can certainly be compelling, there are clear questions pertaining to generalizability and scientific merit, and the inability to independently validate or confirm this anecdotal evidence. While you do briefly acknowledge this, you proceed to make broad claims—such as “No one in her family or among her Black friends showed signs that they preferred to be White or like White people more than Black people. In small towns, the lives of Black and White people are more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group,” again without citing any evidence. I found it problematic to ground these bold claims and critiques largely in anecdote. Further, this raises serious concerns—as reviewer 2 articulates in some detail—that the current work may distort the current state of the science by exaggerating or mischaracterizing the nature of existing claims.

Let me say this clearly: I am strongly in favor of work that attempts to refine existing theoretical perspectives, and/or critique established methods, measures, and paradigms. I am not an IAT “purist” by any stretch, nor has my own recent work consistently included implicit measures. Indeed, as noted above, I read the current work with great interest and openness. Unfortunately, like both reviewers, I cannot support its publication in the current form.

I would sincerely encourage you to consider whether the future of this line of work could involve 1. Additional experiments, 2. Larger and more diverse samples, 3. True and transparent collaboration (whether “adversarial” or not) with colleagues from different ideological/empirical perspectives, and 4. Ensuring that claims align much more closely to what is narrowly warranted by the data at hand. Unfortunately, as it stands, the potential contributions of this work appear to be far overshadowed by its problematic elements.

I understand that you will likely be disappointed by my decision, but I urge you to pay careful attention to the reviewers’ constructive comments, as they may help you revise this manuscript or design further research.  Please understand that my decision was rendered with the recognition that the page limitations of the journal dictate that only a small percentage of submitted manuscripts can be accepted.  PSPB receives more than 700 submissions per year, but only publishes approximately 125 papers each year.  Papers without major flaws are often not accepted by PSPB because the magnitude of the contribution is not sufficient to warrant publication.  With careful revision, I think this paper might be appropriate for a more specialized journal, and I wish you success in finding an appropriate outlet for your work.

I am sorry that I cannot provide a more favorable response to your submission.  However, I do hope that you will again consider PSPB as your research progresses.

Sincerely,
Dr. Corinne Moss-Racusin
Associate Editor
Personality and Social Psychology Bulletin

P.S. I asked Dr. Corinne Moss-Racusin to clarify her comments and her views about the validity of the race IAT as a measure of African Americans’ unconscious preferences. She declined to comment further.

John Jost’s Open Peer Review

After a desk-rejection for JPSP, my co-author and I submitted our ms. to PSPB (see blog https://replicationindex.com/2021/07/28/the-race-implicit-association-test-is-biased/). After several months, we received the expected rejection. But it was not all in vane. We received a signed review by John Jost and for the sake of open science, I am pleased to share it with everybody. My comments are highlighted in bold.

Warning. The content may be graphic and is not suitable for all audiences.

Back in July 2021, the authors sent me a draft of the present paper. I am glad that they did so, because it gave us an opportunity to exchange our opinions and interpretations and to try to correct any misunderstanding or misinterpretations. Unfortunately, however, I see that in the present submission many of those misinterpretations (including false and misleading statements) remain. Thus, I am forced to conclude, reluctantly, that we are not dealing with misunderstandings here but with strategic misrepresentations that seem willful. To be honest, this saddens me, because I thought we could make progress through mutual dialogue. But I don’t see how it serves the goals of science to engage in hyperbole and dismissiveness and to misrepresent so egregiously the views of professional colleagues.

For all of these reasons, and those enumerated below, I am afraid that I cannot support publication of this paper in PSPB.
John Jost

(1) On p. 3 the authors write: “IAT scores close to zero for African Americans have been interpreted as evidence that “sizable proportions of members of disadvantaged groups – often 40% to 50% or even more exhibit implicit (or indirect) biases against their own group and in favor of more advantaged groups” (Jost, 2019, p. 277). This is not true. We did not “interpret” the mean-level scores in terms of frequency distributions (or vice versa). We looked at both. So these are two separate observations; one observation was not used to explain the other. For African Americans the mean-level scores were close to zero (no preference) and, using a procedure described in the note to Figure 1 for Jost et al. (2004 p. 898), we concluded that 39.3% exhibited a pro-White/anti-Black bias. (The 40-50% figure comes from
other intergroup comparisons included in the original article).

It is not important how you did arrive at a precise percentage of unconsciously self-hating African Americans. We used this quote to make clear that you treated the race IAT as a perfectly valid measure of unconscious bias to arrive at the conclusion that a large percentage of African Americans (and a much larger percentage than White Americans) have a preference for the White out-group over the Black in-group. This is the key claim of your article and this is the claim that we challenge. At issue is the validity of the race IAT which is required to make valid claims about African Americans, not the statistical procedure to estimate a percentage.

(2) On p. 4, the authors write: “Jost et al.’s (2004) claims about African Americans follow a long tradition of psychological research on African Americans by mostly White psychologists. Often this research ignores the lived experience of African Americans, which often leads to false claims…” There are two very big problems with this section of the paper, which I have already pointed out to the authors (and they have apparently chosen to ignore them).
(a) The first is that this is an ad hominem critique, directed at me because of a personal characteristic, namely my race. For centuries philosophers have rejected this as a fallacious form of reasoning: whether something is true or false has nothing to do with the personal characteristics of the person making this claim. Furthermore, the senior author (Uli Schimmack) is obviously wielding this critique in bad faith; he, too, is White, so if he took his own objection seriously he would refrain from making any claims about the psychology of African Americans, but he obviously has not refrained from doing so in this submission
or in other forums.

It is a general observation that White researchers have speculated about African American’s self-esteem and mental states often without consulting African Americans. (see our quote of Adams). And I, Ulrich Schimmack, did collaborate with my African American wife on this paper to avoid this very same mistake.

(b) The second problem with this claim, which I have also already pointed out to the authors, is that the very same hypotheses about internalization of inferiority advanced by Jost et al. (2004) in the article in question were, in fact, made by a number of Black scholars, including W.E.B. DuBois, Frantz Fanon, Steven Biko, and Kenneth and Mamie Clark. These influences are discussed in considerable detail in my 2020 book, A Theory of System Justification.

Kenneth and Mamie Clark are the authors’ of the famous doll studies from the 1940s. Are we supposed to believe that nothing has changed over the past 80 years and that we can just use a study with children in 1940s to make claims about adult African Americans’ attitudes in 2014? What kind of social psychologists would ignore the influence of situations and culture on attitudes?

(3) On the next page the authors write: “Just like White theorists’ claims about self-esteem, Jost et al.’s claims about African Americans’ unconscious are removed from African Americans’ own understanding of their culture and identity and disconnected from other findings that are in conflict with the theory’s predictions. The only empirical support for the theory is the neutral score of African Americans on the race IAT.” Now, this claim is absurd. The book cited above describes hundreds of studies providing empirical support for the theory that have nothing to do with the IAT.

Over the past 10 years, we have seen this gaslighting again and again. When one study is criticized, it is defended by pointing to the vast literature of other studies that also support this claim. There may be other evidence, but it is not clear how this other evidence could reveal something about the unconscious. The whole appeal of the IAT was that it shows something that explicit measures cannot show. In fact, explicit ratings often show a stronger in-group favoritism among African Americans. To dismiss this finding, Jost has to allude to the unconscious which shows the hidden preference of Whites.

(4) They go on: “We are skeptical about the claim that most African-Americans secretly favor the outgroup based on the lived experience of the second author” (p. 5). But this was not our claim. As noted above, we found that 39.3% of African Americans (not “most”) exhibited a pro-White/anti-Black bias on the IAT. But, of course, the theory is about relations among variables, not about the specific percentage of Black people who do X, Y, or Z (which is, of course, affected by historical factors, among many other things).

Back to the game with percentages. We do not care whether you wrote 40% or 50%. We care about the fact that you make claims about African American’s unconscious based on an invalid measure.

(5) On p. 6 the authors write: “the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans. Thus, an alternative explanation for the greater in-group bias for White Americans than for African Americans on the race IAT is that attitudes and cultural stereotypes act together for White Americans, whereas they act in opposite directions for African Americans” (p. 6).

As noted above, in July 2021 I wrote to the authors in an attempt to clarify that, from the perspective of SJT, the effects of “cultural stereotypes” in no way support “an alternative explanation” for out-group favoritism, because stereotypes (since the very first article by Jost & Banaji, 1994) have been considered to be system-justifying devices. Here is what I wrote to them: You describe the influence of “cultural stereotypes” as some kind of an alternative to system justification processes, but they are not. The theory started as a way of understanding the origins and consequences of cultural stereotypes. None of this contradicts SJT at all: “The nature of the task may activate cultural stereotypes that are normally not activated when African Americans interact with each other. As a result, the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans.” Yes, this is perfectly consistent with SJT. In fact, it is part of our point. And the purpose of SJT is not to explain what happens “when African Americans interact with each other,” although it may shed some light on intragroup dynamics. I think of the scene in Spike Lee’s (a Black film director, as you well know) movie, School Daze, when the light-skinned and dark-skinned African Americans are fighting/dancing
with each other. There is plenty of system justification going on there, it seems to me.
We may (or may not disagree) in our interpretation of the social dynamics in School Daze, but I feel that the authors are now willfully misrepresenting system justification theory on the issue of “cultural stereotypes,” even after I explicitly sought to clarify their misrepresentation months ago: The activation of cultural stereotypes IS part of what we are trying to understand in terms of SJT.

Jost ignores that many other social psychologists have raised concerns about the validity of the race IAT because it may conflate knowledge of negative stereotypes with endorsement of these stereotypes and attitudes (Olson & Fazio, DOI: 10.1037/0022-3514.86.5.653). For anybody who cares, please ask yourself why Jost does not address the key point of our criticism, namely the use of race IAT scores to make inferences about African Americans’ unconscious without evidence that it can measure conscious or unconscious preferences of African Americans.

(6) It has been a while since I read the Bar-Anan and Nosek (2014) article, but my memory for it is incompatible with the claim that those authors were foolish enough to simply assume that the most valid implicit measures was the one that produced the biggest difference between Whites and Blacks in terms of in-group bias, as the present authors claim (pp. 7-8). As I recall, Bar-Anan and Nosek made a series of serious and comprehensive comparisons between the IAT and other tasks and concluded on the basis of those comparisons, not the one graphed in Figure 1 here, that the validity of the IAT was superior. I feel that, in addition to seriously representing my own work, they are also seriously misrepresenting the work of Bar-Anan and Nosek. Those authors should also have the opportunity to review and/or respond to the present claims being made about the (in) validity of the IAT.

Would you kill Dumbledore if he asked you to?

So, the reviewer relies on his foggy memory to question our claim instead of retrieving a pdf file and checking for himself. New York University should be proud of this display of scholarship. I hope Jost made sure to get his Publons credit. Here is the relevant section from Bar-Anan and Nosek (2014 p. 675; https://link.springer.com/article/10.3758/s13428-013-0410-6).

(7) One methodological improvement of this paper over the previous draft that I saw is that this version now includes other implicit measures, including the single category IAT. However, the hypothesis stated on p. 9, allegedly on behalf of SJT, is incorrect: “System justification theory predicts a score close to zero that would reflect an overall neutral attitude and at least 50% of participants who may hold negative views of the in-group.” This is wrong on several counts and indicates a real lack of familiarity with SJT, which predicts that (to varying degrees) people are motivated to hold favorable attitudes toward themselves (ego justification), their in-group (group justification), and toward the overarching social system (system justification). This last motive—in a departure from the first two—implies that, based on the strength of system justification tendencies, advantaged group members’ attitudes toward the ingroup
will become more favorable and disadvantaged group members’ attitudes toward the in-group
will become less favorable. As noted above, SJT is not about making predictions about absolute scores or frequency counts—these are all subject to historical and many other contextual factors. It would be foolish to predict that African Americans have a neutral (near zero) attitude toward their own group or that 50% have a negative attitude. This is not what the theory says at all. Unless you have separate individual-level estimates of ego, group, and system justification scores, the most one could hypothesize is that on the single category IAT is that African Americans would have a more favorable evaluation of the out-group than European Americans would, and European Americans would have a more favorable
evaluation of the in-group than African Americans would. Note that I am writing this before looking at the results.

We are interested in African Americans and White Americans attitudes towards their in-groups and out-groups. If System Justification Theory (SJT) makes no clear predictions about these attitudes, we do not care about SJT. However, we do care about an article that has been cited over 1,000 times that makes the claim that many African Americans have unconscious negative attitudes towards their in-group and the support of this claim by means of computing a percentage of African Americans who scored above zero on a White-Black IAT (i.e., slower responses when African American is paired with good than when African American is paired with bad). We show that the race IAT lacks convergent validity with other implicit measures and that other implicit measures show different results. Thus, Jost has to justify why we should focus on the IAT results and ignore the results from other IAT tasks. So far, he has avoided talking about our actual empirical results.

(8) On pp. 10-11 the authors concede: “The model was developed iteratively using the data. Thus, all results are exploratory and require validation in a separate sample. Due to the small number of Black participants, it was not possible to cross-validate the model with half of the sample. Moreover, tests of group differences have low power and a study with a larger sample of African Americans is needed to test equivalence of parameters… models with low coverage (many missing data) may overestimate model fit. A follow-up study that administers all tasks to all participants should be conducted to provide a stronger test of the model.” These seem like serious limitations that, in the absence of replication with much larger samples, undermine the very strong conclusions the authors wish to draw.

So Jost can make strong claims (40% of African Americans have unconscious negative attitudes towards their group) based on an unvalidated measure, but when we actually show that the measure lacks validity, we need to replicate our findings first? This is not how science works. Rather, Jost needs to explain why other implicit measures, including the single category IAT do not show the same pattern as the race IAT that was used in the 2001 article.

(9) There is a peculiar paragraph on p. 13 in the “Results” sections, even though it goes well beyond the reporting of results: “Most important is the finding that race IAT scores for African Americans were unrelated to the attitudes towards the in-group and out-group factors. Thus, scores on the race IAT do not appear to be valid measures of African Americans’ attitudes. This finding has important implications for Jost et al.’s (2004) reliance on race IAT scores to make inferences about African Americans’ unconscious attitudes towards their in-group. This interpretation assumed that race IAT scores do provide valid information about African American’s attitudes towards the ingroup, but no evidence for this assumption was provided. The present results show 20 years later that this fundamental assumption is wrong. The race-IAT does not provide information about African Americans’ attitudes towards the in-group as reflected in other implicit measures.”

First of all, I don’t know if one can conclude, even in principle, that the race IAT is invalid for African Americans on the basis of a single study carried out with approximately 200 African American participants. There have been dozens, if not more, studies conducted (see Essien et al., 2020, JPSP), so it seems that any attempt to claim invalidity across the board should be based on a far more comprehensive analysis of larger data sets. Second, if I understand the specific methodological claim here it is that African Americans’ race IAT scores are not correlated with whatever the common factor is that is shared by the other implicit attitude measures (AMP, evaluative priming, and SC-IAT) and one explicit attitude measure (feeling thermometer). At most, it seems to me that one could conclude, on the basis of this, that the race IAT is measuring something different than the other things. This is not all that surprising; indeed, the IAT was supposed to measure something different from feeling thermometers. It seems like a stretch to conclude that the IAT is invalid and the other measures are valid simply because they appear to be measuring somewhat or even completely different things.

Third, the hyperbolic and misleading language implies that something about the IAT is a “fundamental assumption” of SJT, but this is false. The IAT was simply considered to be the best implicit measure at that time (20 years ago), so that is what we used. But it is silly to assume that hypotheses, especially “fundamental” ones, should be forever tied to specific operationalizations. Fourth, the attacking, debunking nature of this paragraph—against the IAT as a methodological instrument and against SJT as a theoretical framework—makes it clear that the authors are not really very interested in the dynamics of ingroup and outgroup favoritism among members of advantaged and disadvantaged groups (measured in different ways). It’s as if the real issue doesn’t even come up here.

Finally, we get to the substantive issue. First, let’s get the gaslighting out of the way. There have not been dozens of studies trying to validate the race IAT for African Americans. There have been zero. This is not surprising because there have also been no serious attempts to validate the race IAT for White respondents or IATs in general (Schimmack, 2021; https://journals.sagepub.com/doi/abs/10.1177/1745691619863798). The key problem is that social psychologists are poorly trained in psychometrics (i.e. the science of psychological measurement and construct validation; Schimmack, 2021, https://open.lnu.se/index.php/metapsychology/article/view/1645).

Now on to the substantive issue. We are the first two show that among African Americans, several implicit measures (e.g., evaluative priming, AMP, single category IAT) show some (modest) convergent validity with each other. Not surprisingly, they also show convergent validity with explicit measures because all measures mostly reflect a common attitude (rather than one conscious and one unconscious ones) (Schimmack, 2021; https://journals.sagepub.com/doi/abs/10.1177/1745691619863798). All of these measures show as much (or more) positivity in in-group attitudes for African Africans as for White Americans. This is an interesting finding because positive attitudes on explicit measures were dismissed by Jost. But now several implicit measures show the same result. Thus, it is not a simple rating bias. Now the race IAT and its variants are the odd ones with a different pattern. Why? That remains to be examined, but to make claims about African Americans’ attitudes we would need to know the answer to this question. Maybe it is just a method artifact? Just raising this possibility is a noteworthy contribution to science.

(10) Eventually, a few pages later, the authors get around to telling us what they really found with respect to the actual research question: “Also expected was the finding that out-group attitudes of African Americans, d = .42, 95%CI , are more favorable than out-group attitudes of White Americans, d = .20, 95%CI.” So, um, African Americans exhibited more favorable attitudes toward Whites than Whites exhibited toward African Americans. This is precisely what system justification theory would have predicted, as I noted above (before looking at the results). It is, perhaps, an interesting discovery — if it is replicated with larger samples — that out-group attitudes are unrelated to in-group attitudes for both groups and that in-group attitudes were equally positive for both groups. But, with respect to the key question of out-group favoritism, the authors actually obtained support for SJT but refuse to even acknowledge it. Is this really what science is about? On the contrary, they draw this outrageous conclusion: “Thus, support for the system justification theory rests on a measurement artifact.” In point of fact, when the authors return to the comparative ingroup vs. outgroup measure they arrive at a conclusion that is virtually the same as Jost et al. (2004): “White Americans’ scores on the race IAT
are systematically biased towards a pro-White score, d = .78, whereas African Americans’ scores are only slightly biased towards a pro-Black score, d = -.19.” Yes, advantaged groups tend to show reasonably strong in-group favoritism, whereas disadvantaged groups tend to show weak in-group favoritism, with substantial proportions showing out-group favoritism. This is precisely what we found 20 years ago. The authors and I already had this exchange back in July, but their paper contains the same misleading statements as before. Here is our exchange:

You write: “Proponents of system justification theory might argue that attitudes towards the in-group have to be evaluated in relative terms. Viewed from this perspective, the results still show relatively more in-group favoritism for White Americans, d = .62 – .20 = .42 than African Americans, d = .54 – .40 = .14. However, out-group attitudes contribute more to this difference, d = .40 = .20 = .20, than in-group differences, d = .62 – .54 = .08. Thus, one reason for the difference in relative preferences is that African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans.”
My response: Yes, this is key. We are talking about the ways in which people respond to relative status, power, and wealth, etc. rankings within a given social system (or society). The fact that “African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans” is supportive of SJT.

Oh boy, sorry if you had to read all of this. Does it make sense to make a distinction between in-group attitudes and out-group attitudes? I hope we can agree that it does. Would we be surprising if Black girls like White dolls more than White girls like Black dolls? Not really and it doesn’t tell us anything about internalizing stereotypes. The important and classic doll study did not care about the comparison of out-group attitudes. The issue was whether Black children preferred White dolls over Black dolls and Jost et al. (2001) claimed that many African Americans internalized negative stereotypes of their group and positive stereotypes of Whites so that they have a relatively greater preference of White over Black. The problem is that the race IAT confounds in-group and out-group attitudes and that measures that avoid this confound like the single-attribute IAT don’t show the same result.

(11) Another huge problem with this whole research program is that it ignores completely the strongest piece of evidence for SJT in this context, namely that the degree of out-group favoritism among disadvantaged groups is positively associated with support for the status quo, measured in terms of political conservatism and individual difference measures of system-justifying beliefs (e.g., see Ashburn-Nardo et al., 2003; Essien et al., 2020; Jost et al., 2004). If Blacks’ responses on the IAT were random or meaningless, I see no reason why they would be consistently correlated with other measures of system justification. But the voluminous literature shows that they are (Essien et al., 2020). Although I have pointed this out to the authors before, they have simply ignored the issue once again, even though this is a key piece of evidence that supports the SJT interpretation of implicit attitudes about advantaged
and disadvantaged groups

Back to gaslighting. Let’s say there are some studies that show this pattern. How does Jost explain the pattern of results in the present study? He doesn’t. That is the point.

(12) All of the above problems are repeated in the General Discussion, so there is no need to address them again point by point. But I will say that other key issues that the authors and I discussed in July are also ignored in the present submission: I wrote: This statement is interesting but far too categorical, in my opinion: “It would be a mistake to interpret this difference in evaluations of the out-group as evidence that African Americans have
internalized negative stereotypes about their in-group.” First, it is not an either/or situation, as if people either love their group or hate it. This is not how people are. There are multiple, conflicting motives involving ego, group, and system justification, and ambivalence is part of what interests us as system justification theorists. Second, there is plenty of other evidence suggesting that—again, to some degree—African Americans and other groups “internalize”
negative stereotypes. Are you really suggesting that there are NO psychological consequences for African Americans living in a society in which they are systematically devalued? I’m still waiting for an answer to that last question. The purpose of this submission, it seems to me, is not to illuminate anything, really, and indeed very little, if anything, is illuminated. The purpose of the paper, it seems, is to create the appearance of something scandalous and awful and perhaps even racist in the research literature when, in fact, the substantive results obtained here are very similar to what has been found before. And if the authors really want to declare that the race-based IAT is a completely useless measure, they have a lot more work to do than re-analyzing previously published data from one relatively small study.

With the confidence of a peer-reviewer in the role of an expert, Jost feels confident enough to lie when he writes “In fact, the substantive results obtained here are very similar to what has been found before.” Really? Nobody has examined convergent validity of various implicit measures among African Americans before. Bar-Anan and Nosek collected the data, but they didn’t analyze them. Instead, they simply concluded that the race IAT is the best measure because it shows the strongest differences between groups. Here we show that implicit measures that can be scored to distinguish in-group and out-group attitudes do not show that African Americans hold negative views of their in-group. Does it matter? Yes it does. Where do African Americans want to live? Who do they want to marry? Would they want other African Americans as colleagues? The answers to these questions depend on their in-group attitudes. So, if Jost cared about African Americans rather than about his theory that made him famous, he might be a bit more interested in our results. However, Jost just displays the same level of curiosity about disconfirming and distressing evidence as many of his colleagues; that is, none. Instead, he fights like a cornered animal to defend his system of ideas against criticism. You might even call this behavior system justification.

Psychology Intelligence Agency

I always wanted to be James Bond, but being 55 now it is clear that I will never get a license to kill or work for a government intelligence agency. However, the world has changed and there are other ways to spy on dirty secrets of evil villains.

I have started to focus on the world of psychological science, which I know fairly well because I was a psychological scientist for many years. During my time as a psychologist, I learned about many of the dirty tricks that psychologists use to publish articles to further their careers without advancing understanding of human behavior, thoughts, and feelings.

However, so far the general public, government agencies, or government funding agencies that hand out taxpayers’ money to psychological scientists have not bothered to monitor the practices of psychological scientists. They still believe that psychological scientists can control themselves (e.g., peer review). As a result, bad practices persist because the incentives favor behaviors that lead to publication of many articles even if these articles make no real contribution to science. I therefore decided to create my own Psychological Intelligence Agency (PIA). Of course, I cannot give myself a license to kill, and I have no legal authority to enforce laws that do not exist. However, I can gather intelligence (information) and share this information with the general public. This is less James Bond and more CIA that also shares some of its intelligence with the public (CIA factbook), or the website Retraction Watch that keeps track of article retractions.

Some of the projects that I have started are:

Replicability Rankings of Psychology Journals
Keeping track of the power (expected discovery rate, expected replication rate) and the false discovery risk of test results published in over 100 psychology journals from 2010 to 2020.

Personalized Criteria of Statistical Significance
It is problematic to use the standard criterion of significance (alpha = .05) when this criterion leads to few discoveries because researchers test many false hypotheses or test true hypotheses with low power. When discovery rates are low, alpha should be set to a lower value (e.g., .01, .005, .001). Here I used estimates of authors’ discovery rate to recommend an appropriate alpha level to interpret their results.

Quantitative Book Reviews
Popular psychology books written by psychological scientists (e.g., Nobel Laureate Daniel Kahneman) reach a wide audience and are assumed to be based on solid scientific evidence. Using statistical examinations of the sources cited in these books, I provide information about the robustness of the scientific evidence to the general public. (see also “Before you know it“)

Citation Watch
Science is supposed to be self-correcting. However, psychological scientists often cite outdated references that fit their theory without citing newer evidence that their claims may be false (a practice known as cherry picking citations). Citation watch reveals these bad practice, by linking articles with misleading citations to articles that question the claims supported by cherry picked citations.

Whether all of this intelligence gathering will have a positive effect depends on how many people actually care about the scientific integrity of psychological science and the credibility of empirical claims. Fortunately, some psychologists are willing to learn from past mistakes and are improving their research practices (Bill von Hippel).

You Can Lead a Horse To Water, But... - Meaning, Origin

What would Cohen say to 184 Significance Tests in 1 Article

I was fortunate enough to read Jacob Cohen’s articles early on in my career to avoid many of the issues that plague psychological science. One of his important lessons was that it is better to test a few (or better one) hypothesis in one large sample (Cohen, 1990) than to conduct many tests in small samples.

The reason is simple. Even if a theory makes a correct prediction, sampling error may produce a non-significant result, especially in small samples where sampling error is large. This type of error is known as type-II error, beta, or a false negative. The probability of obtaining the desired and correct outcome of a significant result, when a hypothesis is true is called power. The problem of testing multiple hypotheses is that the cumulative or total power of finding evidence for all correct hypotheses decreases with the number of tests. Even if a single test has 80% power (i.e., the probability of a significant result for a correct hypothesis is 80 percent), the probability of providing evidence for 10 correct hypotheses is only .8^10 = .11%. The expected value is that 2 of the 10 tests produce a type-II error (Schimmack, 2012).

Cohen (1961) also noted that the average power of statistical tests is well below 80%. For a medium/average effect size, power was around 50%. Now imagine that a researcher tests 10 true hypotheses with 50% power. The expected value is that 5 tests produce a significant result (p < .05) and 5 studies produce a type-II error (p > .05). The interpretation of the article will focus on the significant results, but they were selected basically by a coin flip. The next study will produce a different set of 5 significant studies.

To avoid type-II errors researchers could conduct a priori power analysis to ensure that they have enough power. However, this is rarely done with the explanation that a priori power analysis requires knowledge about the population effect size, which is unknown. However, it is possible to estimate the typical power of studies by keeping track of the percentage of significant results. Because power determines the rate of significant results, the rate of significant results is an estimate of average power. The main problem with this simple method of estimating power is that researchers often do not report all of their results. Especially before the replication crisis became apparent, psychologists tended to publish only significant results. As a result, it is largely unknown how much power actual studies in psychology have and whether power increased since Cohen (1961) estimated power to be around 50%.

Here I illustrate a simple way to estimate actual power of studies with a recent multi-study article that reported a total of 184 significance tests (more were reported in a supplement, but were not coded)! Evidently, Cohen’s important insights remain neglected, especially in journals that pride themselves on rigorous examination of hypotheses (Kardas, Kumar, & Epley, 2021).

Figure 2 shows the first rows of the coding spreadsheet (Spreadsheet).

Each row shows one specific statistical test. The column “HO rejected” reflects how authors interpreted a result. Broadly this decision is based on the p < .05 rule, but sometimes authors are willing to treat values just above .05 as sufficient evidence which is often called marginal significance. The column p < .05 strictly follows the p < .05 rule. The averages in the top row show that there are 77% significant results using authors’ rules and 71% using the p < .05 rule. This shows that 6% of the p-values were interpreted as marginally significant.

All test-values or point estimates with confidence intervals are converted into exact two-sided p-values. The two-sided p-values are then converted into z-scores using the inverse normal formula; z = -qnorm(2). Observed power is then estimated for the standard criterion of significance; alpha = .05, which corresponds to a z-score of 1.96. The formula for observed power is pnorm(z, 1.96). The top row shows that mean observed power is 69%. This is close to the 71% percentage with the strict p < .05 rule, but a bit lower than the 77% when marginally significant results are included. This simple comparison shows that marginally significant results inflate the percentage of significant results.

The inflation column keeps track of the consistency between the outcome of a significance test and the power estimate. When power is practically 1, a significant result is expected and inflation is zero. However, when power is only 60%, there is a 40% chance of a type-II error and authors were lucky if they got a significant result. This can happen in a single test, but not in the long run. Average inflation is a measure of how lucky authors were if they got more significant results than the power of their studies allows. Using the authors 77% success rate and estimated power of 69%, we have an inflation of 8%. This is a small bias, and we already saw that interpretation of marginal results accounts for most of it.

The last column is called the Replication Index (R-Index). It simply subtracts the inflation from the observed power estimate. The reason is that observed power is an inflated estimate of power when there are too many significant results. The R-Index is called an index because the formula is just an approximate correction for selection for significance. Later I show the results with a better method. However, the Index can clearly distinguish between junk science (R-Index below 50) and credible evidence. Based on the present results, the R-Index of 62 shows that the article reported some credible findings. Moreover, the R-Index now underestimates power because the rate of p-values below .05 is consistent with observed power. The inflation is just due to the interpretation of marginal results as significant. In short, the main conclusion from this simple analysis of test statistics in a single article is that the authors conducted studies with an average power of about 70%. This is expected to produce type-II errors, sometimes with p-values close to .05 and sometimes with p-values well above .1. This could mean that nearly a quarter of the published results are type-II errors.

but what about type-I errors?

Cohen was concerned about the problem that many underpowered studies fail to reject true hypotheses. However, the replication crisis shifted the focus from false negative results to false-positive results. An influential article by Simmons et al. (2011) suggested that many if not most published results might be false positive results. The authors also developed a statistical tool that examines whether a set of significant results is entirely based on false positive results called p-curve. The next figure shows the output of the p-curve app for the 130 significant results (only significant results are considered because p-values greater than .05 cannot be false positives).

The graph shows that there a lot more p-values below .01 (78%) than p-values between .04 and .05 (2%). This distribution of p-values is inconsistent with the hypothesis that all significant results are false positives. In addition, the program estimates that the average power of the 130 studies with significant results is 99%! As a result, there can be no false positives that would produce an estimate of 5% power. It is noteworthy that the p-curve analysis did not spot the inflation of significant results by interpreting marginally significant results because these results are omitted from the p-curve analysis. It is rather unlikely that the average power of studies is 99%. In fact, simulation studies have shown that the power estimates of p-curve are often inflated when studies are heterogeneous (Brunner, 2018; Brunner & Schimmack, 2020). The p-curve authors are aware of this bug, but have done nothing to fix it (Datacolada, 2018).

A better statistical method to analyze p-values is z-curve, which relies on the z-scores that were obtained from the p-values in the spreadsheet. However, the z-curve package for R can also read p-values. The next Figure shows a histogram of all 184 (significant and non-significant) values up to a value of 6. Values over 6 are not shown and are all treated as studies with perfect power.

The expected discovery rate corresponds to the power estimate in p-curve. It is notably lower than 99% and the 95%CI excludes a value of 99%. This finding simply shows once again that p-curve estimates are inflated.

The observed discovery rate is simply the same percentage that was computed on the spreadsheet using a strict p < .05 rule. The expected discovery rate is an estimate of the average power for all studies, including non-significant results that is corrected for any potential inflation. It is 62%, which matches the R-Index in the spreadsheet.

The comparison of the observed discovery rate of 71% and the expected discovery rate of 62% suggests that there is some overreporting of significant results. However, the 95%CI around the EDR estimate ranges from 27% to 88%. Thus, sampling error alone may explain this discrepancy.

An EDR of 62% implies that only a small number of significant results can be false positives. The point estimate is just 2%, but the 95%CI allows for up to 14% false positives. Thus, the reported results are unlikely to be false positives, but effect sizes could be inflated because selection for significance with modest power inflates effect size estimates.

There is also notable evidence of heterogeneity. The distribution of z-scores is much flatter than a standard normal distribution that is expected if all studies had the same power. This means that some results might be more credible than others. Therefore I conducted some moderator analyses.

One key hypothesis in the article was that shallow and deep conversations differ in important ways. Several studies tested this by comparing shallow and deep conversations. Fifty-four analyses included a contrast between shallow and deep conversations as a main effect or in an interaction. The expected replication rate is unchanged. The expected discovery rate is a bit higher, but surprisingly, the observed discovery rate is lower. Visual inspection of the z-curve plot shows an unusually high number of marginally significant results. This is further evidence to distrust marginally significant results. However, overall these results suggest that shallow and deep conversations differ.

Several analyses tested mediation, which can require large samples to have adequate power. Not surprisingly, the 39 mediation tests have only a replication rate of 53%. There is also some suggestion of bias, with an observed discovery rate of 51% and an expected discovery rate of only 25%, but the 95%CI around the point estimate is wide and includes 51%. The low expected discovery rate implies that the false discovery risk is 16%, which is unacceptably high.

One solution to the high false discovery risk is to lower the criterion for significance. The next conventional level is alpha = .01. The next figure shows the results for this criterion value (the red solid line has moved to z = 2.58.

Now the observed discovery rate is in line with the expected discovery rate (28% vs. 27%) and the false discovery risk has been lowered to 3%. However, the expected replication rate (for alpha = .01) is only 36%. Thus, follow-up studies need to increase sample sizes to replicate these mediation effects.

Conclusion

A post-hoc power-analysis of this recent article shows that psychologists still have not learned Cohen’s lesson that he shared in 1990 (more than 30 years ago). Conducting many significance tests with modest statistical power produces a confusing pattern of significant and non-significant results that is strongly influenced by sampling error. Rather than reporting results of individual studies, the authors should have reported meta-analytic results for tests of the same hypothesis. However, to end on a positive note, the studies are not p-hacked and the risk of false positives is low. Thus, the results provide some credible findings that can be used to conduct confirmatory tests of the hypothesis that deeper conversations are more awkward, but also more rewarding. I hope these analyses show that a deep dive into the statistical results reported in an article can also be rewarding.

Citation Watch

Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.

Table of Incorrect Citations
Authors: Scott W. Phillips; Dae-Young Kim
Year: 2021
Citation: Johnson et al. (2019) found no evidence for disparity in the shooting deaths of Black or Hispanic people. Rather, their data indicated an anti-White disparity in OIS deaths.
DOI: https://journals.sagepub.com/doi/10.1177/0093854821997529
Correction: Retraction (https://www.pnas.org/content/117/30/18130)
Authors: Richard Stansfield, Ethan Aaronson, Adam Okulicz-Kozaryn
Year: 2021
Citation: While recent studies increasingly control for officer and incident characteristics (e.g., Fridell & Lim, 2016; Johnson et al., 2019; Ridgeway et al., 2020)
DOI: https://doi.org/10.1016/j.jcrimjus.2021.101828
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: P. A. Hancock; John D. Lee; John W. Senders
Citation: Misattributions involved in such processes of assessment can, as we have seen, lead to adverse consequences (e.g., Johnson et al., 2019).
DOI: DOI: 10. 1177/ 0018 7208 2110 36323
Correction: Retraction (https://www.pnas.org/content/117/30/18130)
Authors: Desmond Ang
Citation: While empirical evidence of racial bias is mixed (Nix et al. 2017; Fryer 2019; Johnson et al. 2019; Knox, Lowe, and Mummolo 2020; Knox and Mummolo 2020)
DOI: doi:10.1093/qje/qjaa027
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Lara Vomfell; Neil Stewart
Year: 2021
Citation: Some studies have argued that the general population in an area is not the appropriate comparison: instead one should compare rates of use of force to how often Black and White people come into contact with police [59–61]
DOI: https://www.nature.com/articles/s41562-020-01029-w
Correction: [60] Johnson et al. Retracted (https://www.pnas.org/content/117/30/18130)
Authors: Jordan R. Riddell; John L. Worrall
Year: 2021
Citation: Recent years have also seen improvements in benchmarking-related research, that is, in formulating methods to more accurately analyze whether bias (implicit or explicit) or racial disparities exist in both UoF and OIS. Recent examples include Cesario, Johnson, and Terrill (2019), Johnson, Tress, Burkel, Taylor, and Cesario (2019), Shjarback and Nix (2020), and Tregle, Nix, and Alpert (2019).
DOI: https://doi.org/10.1016/j.jcrimjus.2020.101775
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Dean Knox, Will Lowe, Jonathan Mummolo
Year: 2021
Citation: A related study, Johnson et al. (2019), attempts to estimate racial bias in police shootings. Examining only positive cases in which fatal shootings occurred, they find that the majority of shooting victims are white and conclude from this that no antiminority bias exists
DOI: https://doi.org/10.1017/S0003055420000039
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Ming-Hui Li, Pei-Wei Li ,Li-Lin Rao
Year: 2021
Citation: The IAT has been utilized in diverse areas and has proven to have good construct validity and reliability (Gawronski et al., 2020).
DOI: https://doi.org/10.1016/j.paid.2021.111107
Correction: does not cite critique of the construct validity of IATs (https://doi.org/10.1177/1745691619863798)

Authors: Chew Wei Ong, Kenichi Ito
Year: 2021
Citation: This penalty treatment of error trials has been shown to improve the correlations between the IAT and explicit measures, indicating a greater construct validity of the IAT.
DOI: 10.1111/bjso.12503
Correction: higher correlations do not imply higher construct validity of IATs as measures of implicit attitudes (https://doi.org/10.1177/1745691619863798)

A one hour introduction to bias detection

This introduction to bias detection builds on introductions to test-statistics and statistical power (Into to Statistics, Intro to Power).

It is well known that many psychology articles report too many significant results because researchers selectively publish results that support their predictions (Francis, 2014; Sterling, 1959; Sterling et al., 1995; Schimmack, 2021). This often leads to replication failures (Open Science Collaboration, 2015).

One way to examine whether a set of studies reported too many significant results is to compare the success rate (i.e., the percentage of significant results) with the mean observed power in studies (Schimmack, 2012). In this video, I illustrate this bias detection method using Vohs et al.’s (2006) Science article “The Psychological Consequences of Money.

I use this students for training purposes because the article reports 9 studies and a reasonably large number of studies is needed to have good power to detect selection bias. Also, the article is short and the results are straight forward. Thus, students have no problem filling out the coding sheet that is needed to compute observed power (Coding Sheet).

The results show clear evidence of selection bias that undermine the credibility of the reported results (see also TIVA). Although bias tests are available, few researchers use them to protect themselves from junk science and articles like this one continue to be cited at high rates (683 total, 67 in 2019). A simple way to protect yourself from junk science is to adjust the alpha level to .005 because many questionable practices produce p-values that are just below .05. For example, the lowest p-value in these 9 studies was p = .006. Thus, not a single study was statistically significant with alpha = .005.

Intro to Statistical Power in One Hour

Last week I posted a video that provided an introduction to the basic concepts of statistics, namely effect sizes and sampling error. A test statistic like a t-value, is simply the ratio of the effect size over sampling error. This ratio is also known as a signal to noise ratio. The bigger the signal (effect size), the more likely it is that we will notice it in our study. Similarly, the less noise we have (sampling error), the easier it is to observe even small signals.

In this video, I use the basic concepts of effect sizes and sampling error to introduce the concept of statistical power. Statistical power is defined as the percentage of studies that produce a statistically significant result. When alpha is set to .05, it is the expected percentage of p-values with values below .05.

Statistical power is important to avoid type-II errors; that is, there is a meaningful effect, but the study fails to provide evidence for it. While researchers cannot control the magnitude of effects, they can increase power by lowering sampling error. Thus, researchers should carefully think about the magnitude of the expected effect to plan how large their sample has to be to have a good chance to obtain a significant result. Cohen proposed that a study should have at least 80% power. The planning of sample sizes using power calculation is known as a priori power analysis.

The problem with a priori power analysis is that researchers may fool themselves about effect sizes and conduct studies with insufficient sample sizes. In this case, power will be less than 80%. It is therefore useful to estimate the actual power of studies that are being published. In this video, I show that actual power could be estimated by simply computing the percentage of significant results. However, in reality this approach would be misleading because psychology journals discriminant against non-significant results. This is known as publication bias. Empirical studies show that the percentage of significant results for theoretically important tests is over 90% (Sterling, 1959). This does not mean that mean power of psychological studies is over 90%. It merely suggests that publication bias is present. In a follow up video, I will show how it is possible to estimate power when publication bias is present. This video is important to understand what statistical power.

Intro to Statistics In One Hour

Each year, I am working with undergraduate students on the coding of research articles to examine the replicability and credibility of psychological science (ROP2020). Before students code test-statistics from t-tests or F-tests in results sections, I provide a crash course on inferential statistics (null-hypothesis significance testing). Although some students have taken a basic stats course, the courses often fail to teach a conceptual understanding of statistics and distract students with complex formulas that are treated like a black box that converts data into p-values (or worse starts that reflect whether p < .05*, p < .01**, or p < .001***).

In this one-hour lecture, I introduce the basic principles of null-hypothesis significance testing using the example of the t-test for independent samples.

An Introduction to T-Tests | Definitions, Formula and Examples

I explain that a t-value is conceptual made up of three components, namely the effect size (D = x1 – x2), a measure of the natural variation of the dependent variable (the standard deviation (s), and a measure of the amount of sampling error (simplified se = 2/sqrt (n1 + n2)).

Moreover, dividing the effect size D by the standard deviation provides the familiar standardized effect size, Cohen’s d = D/s. This means that a t-value corresponds to the ratio of the standardized effect size (d) over the amount of sampling error (se), t = d/se

It follows that a t-value is influenced by two quantities. T-values increase as the standardized (unit-free) effect sizes increase and as the sampling error decreases. The two quantities are sometimes called signal (effect size) and noise (sampling error). Accordingly, the t-value is the signal to noise ratio. I compare the signal and noise to an experiment where somebody is throwing rocks into a lake and somebody has to tell whether a rock was thrown based on the observation of a splash. A study with a small effect and a lot of noise is like trying to detect the splash of a small pebble on a very windy, stormy day where waves are creating a lot of splashes that make it hard to see the small splash made by a pebble. However, if you throw a big rock into the lack, you can see the big splash from the rock even when the wind creates a lot of splashing. If you want to see the splash of a pebble, you need to wait for a calm day without wind. These conditions correspond to a study with a large sample and very little sampling error.

Have a listen and let me know how I am doing. Feel free to ask questions that help me to understand how I can make the introduction to statistics even easier. Too many statistics books and lecturers intimidate students with complex formulas and Greek symbols that make statistics look hard, but in reality it is very simple. Data always have two components. The signal you are looking for and noise that makes it hard to see the signal. The bigger the signal to noise ratio is, the more likely it is that you saw a true signal. Of course, it can be hard to quantify signals and noise and statisticians work hard in getting good estimates of noise, but that does not have to concern users of statistics. As users of statistics we just trust statisticians that they have good (the best) estimates to see how good our data are.

Rejection Watch: Censorship at JEP-General

Articles published in peer-reviewed journals are only a tip of the scientific iceberg. Professional organizations want you to believe that these published articles are carefully selected to be the most important and scientifically credible articles. In reality, peer-review is unreliable, invalid, and editorial decisions are based on personal preferences. For this reason, the censoring mechanism is often hidden. Part of the movement towards open science is to make the censoring process transparent.

I therefore post the decision letter and the reviews from JEP:General. I sent my ms “z-curve: an even better p-curve” to this journal because it published two articles on the p-curve method that are highly cited. The key point of my ms. is to point out that the p-curve app produces a “power” estimate of 97% for hand-coded articles by Leif Nelson, while z-curve produces an estimate of 52%. If you are a quantitative scientist, you will agree that this is a non-trivial difference and you are right to ask which of these estimates is more credible. The answer is provided by simulation studies that compare p-curve and z-curve and show that p-curve can dramatically overestimate “power” when the data are heterogeneous (Brunner & Schimmack, 2020). In short, the p-curve app sucks. Let the record show that JEP-General is happy to get more citations for a flawed method. The reason might be that z-curve is able to show publication bias in the original articles published in JEP-General (Replicability Rankings). Maybe Timothy J. Pleskac is afraid that somebody looks at his z-curve, which shows a few too many p-values that are just significant (ODR = 73% vs. EDR = 45%).

Unfortunately for psychologists, statistics is an objective science that can be evaluated using either mathematical proofs (Brunner & Schimmack, 2020) and simulation studies (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). It is just hard for psychologists to follow the science, if the science doesn’t agree with their positive illusions and inflated egos.

==========================================================

XGE-2021-3638
Z-curve 2.0: An Even Better P-Curve
Journal of Experimental Psychology: General

Dear Dr. Schimmack,

I have received reviews of the manuscript entitled Z-curve 2.0: An Even Better P-Curve (XGE-2021-3638) that you recently submitted to Journal of Experimental Psychology: General. Upon receiving the paper I read the paper. I agree that Simonsohn, Nelson, & Simmons’ (2014) P-Curve paper has been quite impactful. As I read over the manuscript you submitted, I saw there was some potential issues raised that might help help advance our understanding of how to evaluate scientific work. Thus, I asked two experts to read and comment on the paper. The experts are very knowledgeable and highly respected experts in the topical area you are investigating.

Before reading their reviews, I reread the manuscript, and then again with the reviews in hand. In the end, both reviewers expressed some concerns that prevented them from recommending publication in Journal of Experimental Psychology: General. Unfortunately, I share many of these concerns. Perhaps the largest issue is that both reviewers identified a number formal issues that need more development before claims can be made about the z-curve such as the normality assumptions in the paper. I agree with Reviewer 2 that more thought and work is needed here to establish the validity of these assumptions and where and how these assumptions break down. I also agree with Reviewer 1 that more care is needed when defining and working with the idea of unconditional power. It would help to have the code, but that wouldn’t be sufficient as one should be able to read the description of the concept in the paper and be able to implement it computationally. I haven’t been able to do this. Finally, I also agree with Reviewer 1 that any use of the p-curve should have a p-curve disclosure table. I would also suggest ways to be more constructive in this critique. In many places, the writing and approach comes across as attacking people. That may not be the intention. But, that is how it reads.

Given these concerns, I regret to report that that I am declining this paper for publication in Journal of Experimental Psychology: General. As you probably know, we can accept only small fraction of the papers that are submitted each year. Accordingly, we must make decisions based not only on the scientific merit of the work but also with an eye to the potential level of impact for the findings for our broad and diverse readership. If you decide to pursue publication in another journal at some point (which I hope you will consider), I hope that the suggestions and comments offered in these reviews will be helpful.

Thank you for submitting your work to the Journal. I wish you the best in your continued research, and please try us again in the future if you think you have a manuscript that is a good fit for Journal of Experimental Psychology: General.

Sincerely,

Timothy J. Pleskac, Ph.D.
Associate Editor
Journal of Experimental Psychology: General

Reviewers’ comments:

Reviewer #1: 1. This commentary submitted to JEPG begins presenting a p-curve analysis of early work by Leif Nelson.
Because it does not provide a p-curve disclosure table, this part of the paper cannot be evaluated.
The first p-curve paper (Simonsohn et al, 2014) reads: “P-curve disclosure table makes p-curvers accountable for decisions involved in creating a reported p-curve and facilitates discussion of such decisions. We strongly urge journals publishing p-curve analyses to require the inclusion of a p-curve disclosure table.” (p.540). As a reviewer I am aligning with these recommendation and am *requiring* a p-curve disclosure table, as in, I will not evaluate that portion of the paper, and moreover I will recommend the paper be rejected unless that analysis is removed, or a p-curve disclosure table is included, and is then evaluated as correctly conducted by the review team in an subsequent round of evaluation. The p-curve disclosure table for the Russ et al p-curve, even if not originally conducted by these authors, should be included as well, with a statement that the authors of this paper have examined the earlier p-curve disclosure table and deemed it correct. If an error exists in the literature we have to fix it, not duplicate it (I don’t know if there is an error, my point is, neither do the authors who are using it as evidence).

2. The commentary then makes arguments about estimating conditional vs unconditional power. While not exactly defined in the article, the authors come pretty close to defining conditional power, I think they mean by it the average power conditional on being included in p-curve (ironically, if I am wrong about the definition, the point is reinforced). I am less sure about what they mean by unconditional power. I think they mean that they include in the population parameter of interest not only the power of the studies included in p-curve, but also the power of studies excluded from it, so ALL studies. OK, this is an old argument, dating back to at least 2015, it is not new to this commentary, so I have a lot to say about it.

First, when described abstractly, there is some undeniable ‘system 1’ appeal to the notion of unconditional power. Why should we restrict our estimation to the studies we see? Isn’t the whole point to correct for publication bias and thus make inferences about ALL studies, whether we see them or not? That’s compelling. At least in the abstract. It’s only when one continues thinking about it that it becomes less appealing. More concretely, what does this set include exactly? Does ‘unconditional power’ include all studies ever attempted by the researcher, does it include those that could have been run but for practical purposes weren’t? does it include studies run on projects that were never published, does it include studies run, found to be significant, but eventually dropped because they were flawed? Does it include studies for which only pilots were run but not with the intention of conducting confirmatory analysis? Does it include studies which were dropped because the authors lost interest in the hypothesis? Does it include studies that were run but not published because upon seeing the results the authors came up with a modification of the research question for which the previous study was no longer relevant? Etc etc). The unconditional set of studies is not a defined set, without a definition of the population of studies we cannot define a population parameter for it, and we can hardly estimate a non-existing parameter. Now. I don’t want to trivialize this point. This issue of the population parameter we are estimating is an interesting issue, and reasonable people can disagree with the arguments I have outlined above (many have), but it is important to present the disagreement in a way that readers understand what it actually entails. An argument about changing the population parameter we estimate with p-curve is not about a “better p-curve”, it is about a non-p-curve. A non-p-curve which is better for the subset of people who are interested in the unconditional power, but a WORSE p-curve for those who want the conditional power (for example, it is worse for the goals of the original p-curve paper). For example, the first paper using p-curve for power estimation reads “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve”. So a tool which does not estimate that value, but a different value, it is not better, it is different. The standard deviation is neither better nor worse than the mean. They are different. It would be silly to say “Standard Deviation, a better Mean (because it captures dispersion and the mean does not)”. The standard deviation is better for someone interested in dispersion, and the standard deviation is worse for someone interested in the central tendency. Exactly the same holds for conditional vs unconditional power. (well, the same if z-curve indeed estimated unconditional power, i don’t know if that is true or not. Am skeptical but open minded).

Second, as mentioned above, this distinction of estimating the parameter of the subset of studies included in p-curve vs the parameter of “all studies” is old. I think that argument is seen as the core contribution of this commentary, and that contribution is not close to novel. As the quote above shows, it is a distinction made already in the original p-curve paper for estimating power. And, it is also not new to see it as a shortcoming of p-curve analysis. Multiple papers by Van Assen and colleagues, and by McShane and colleagues, have made this argument. They have all critiqued p-curve on those same grounds.

I therefore think this discussion should improve in the following ways: (i) give credit, and give voice, to earlier discussions of this issue (how is the argument put forward here different from the argument put forward in about a handful of previous papers making it, some already 5 years ago), (ii) properly define the universe of studies one is attempting to estimate power for (i.e., what counts in the set of unconditional power), and (iii) convey more transparently that this is a debate about what is the research question of interest, not of which tool provides the better answer to the same question. Deciding whether one wants to estimate the average power of one or another set of studies is completely fair game of an issue to discuss, and if indeed most readers don’t think they care about conditional power, and those readers use p-curve not realizing that’s what they are estimating, it is valuable to disabuse them of their confusion. But it is not accurate, and therefore productive, to describe this as a statistical discussion, it is a conceptual discussion.

3. In various places the paper reports results from calculations, but the authors have not shared neither the code nor data for those calculations, so these results cannot be adequately evaluated in peer-review, and that is the very purpose of peer-review. This shortcoming is particularly salient when the paper relies so heavily on code and data shared in earlier published work.

Finally, it should be clearer what is new in this paper. What is said here that is not said in the already published z-curve paper and p-curve critique papers?

Reviewer #2:
The paper reports a comparison between p-curve and z-curve procedures proposed in the literature. I found the paper to be unsatisfactory, and therefore cannot recommend publication in JEP:G. It reads more like a cropped section from the author’s recent piece in meta-psychology than a standalone piece that elaborates on the different procedures in detail. Because a lot is completely left out, it is very difficult to evaluate the results. For example, let us consider a couple of issues (this is not an exhaustive list):

– The z-curve procedure assumes that z-transformed p-values under the null hypothesis follow a standard Normal distribution. This follows from the general idea that the distribution of p-values under the null-hypothesis is uniform. However, this general idea is not necessarily true when p-values are computed for discrete distributions and/or composite hypotheses are involved. This seems like a point worth thinking about more carefully, when proposing a procedure that is intended to be applied to indiscriminate bodies of p-values. But nothing is said about this, which strikes me as odd. Perhaps I am missing something here.

– The z-curve procedure also assumes that the distribution of z-transformed p-values follows a Normal distribution or a mixture of homoskedastic Normals (distributions that can be truncated depending on the data being considered/omitted). But how reasonable is this parametric assumption? In their recently published paper, the authors state that this is as **a fact**, but provide no formal proof or reference to one. Perhaps I am missing something here. If anything, a quick look at classic papers on the matter, such as Hung et al. (1997, Biometrics), show that the cumulative distributions of p-values under different alternatives cross-over, which speaks against the equal-variance assumption. I don’t think that these questions about parametric assumptions are of secondary importance, given that they will play a major in the parameter estimates obtained with the mixture model.

Also, when comparing the different procedures, it is unclear whether the reported disagreements are mostly due to pedestrian technical choices when setting up an “app” rather than irreconcilable theoretical commitments. For example, there is nothing stopping one from conducting a p-curve analysis on a more fine-grained scale. The same can be said about engaging in mixture modeling. Who is/are the culprit/s here?

Finally, I found that the writing and overall tone could be much improved.
 

Personality Over Time: A Historic Review

The hallmark of a science is progress. To demonstrate that psychology is a science therefore requires evidence that current evidence, research methods, and theories are better than those in the past. Historic reviews are also needed because it is impossible to make progress without looking back once in a while.

Research on the stability or consistency of personality has a long history that started with the first empirical investigations in the 1930s, but a historic review of this literature is lacking. Few young psychologists interested in personality development may be familiar with Kelly, his work, or his American Psychologist article on “Consistency of the Adult Personality” (Kelly, 1955). Kelly starts his article with some personal observations about stability and change in traits that he observed in colleagues over the years.

Today, we call traits that are neither physical characteristics, nor cognitive abilities, personality traits that are represented in the Big Five model. What have we learned about the stability of personality traits in adulthood from nearly a century of research?

Kelly (1955) reported some preliminary results from his own longitudinal study of personality that he started in the 1930s with engaged couples. Twenty years-later, they completed follow-up questionnaires. Figure 6 reported the results for the Allport-Vernon value scales. I focus on these results because they make it possible to compare the retest-correlations to retest-correlations over a one-year period.

Figure 6 shows that personality, or at least values, are not perfectly stable. This is easily seen by a comparison of the one-year retest correlations with the 20-year retest correlations. The 20-year retest correlations are always lower than the one-year retest correlations. Individual differences in values change over time. Some individuals become more religious and others become less religious, for example. The important question is how much individuals change over time. To quantify change and stability it is important to specify a time interval because change implies lower retest correlations over longer retest intervals. Although the interval is arbitrary, a period of 1-year or 10-year can be used to quantify and compare stability and change of different personality traits. To do so, we need a model of change over time. A simple model is Heise’s (1969) autoregressive model that assumes a constant rate of change.

Take religious values as an example. Here we have two observed retest correlations, r(y1) = .60, and r(y20) = .75. Both correlations are attenuated by random measurement error. To correct for unreliability, we need to solve two equations with two unknowns, the rate of change and reliability.
.75 = rate^1 * rel
.60 = rate^20 * rel
With some rusty high-school math, I was able to solve this equation for rate
rate = (.60/.75)^(1/(20-1) = .988
The implied 10-year stability is .988^10 = .886.
The estimated reliability is .75 / .988 = .759.

Table 1 shows the results for all six values.

Value 1-year 20-year Reliability 1-Year Rate 10-Year Rate
Theoretical 0.71 0.51 0.72 0.983 0.840
Economic 0.72 0.50 0.73 0.981 0.825
Aesthetic 0.71 0.52 0.72 0.984 0.849
Social 0.68 0.32 0.71 0.961 0.673
Political 0.75 0.49 0.77 0.978 0.799
Religious 0.75 0.60 0.76 0.988 0.889
Table 1
Stability and Change of Allport-Vernon Values

The results show that the 1-year retest correlations are very similar to the reliability estimates of the value measure. After correcting for unreliability the 1-year stability is extremely high with stability estimates ranging from .96 for social values to .99 for religious values. The small differences in 1-year stabilities become only notable over longer time periods. The estimated 10-year stability estimates range from .68 for social values to .90 for religious values.

Kelly reported results for two personality constructs that were measured with the Bernreuter personality questionnaire, namely self-confidence and sociability.

The implied stability of these personality traits is similar to the stability of values.

Personality 1-year 20-year Reliability 1-Year Rate 10-Year Rate
Self-Confidence 0.86 0.61 0.88 0.982 0.835
Sociability 0.78 0.45 0.80 0.971 0.749

Kelly’s results published in 1955 are based on a selective sample during a specific period of time that included the second world war. It is therefore possible that studies with other populations during other time periods produce different results. However, the results are more consistent than different across different studies.

The first article with retest correlations for different time intervals of reasonable length was published in 1941 by Mason N. Crook. The longest retest interval was 6-years and six months. Figure 1a in the article plotted the retest correlations as a function of the retest interval.

Table 2 shows the retest correlations and reveals that some of them are based on extremely small sample sizes. The 5-month retest is based on only 30 participants whereas the 8 months retest is based on 200 participants. Using this estimate for the short-term stability, it is possible to estimate the 1-year rate and 10-year rates using the formula given above.

Sample Size Months retest Reliability 1-Year Rate 10-Year Rate
140 20 0.698 0.72 0.990 0.908
80 22 0.670 0.73 0.977 0.789
18 27 0.431 0.83 0.861 0.223
80 32 0.602 0.75 0.958 0.650
40 39 0.646 0.73 0.979 0.812
70 44 0.577 0.74 0.962 0.677
50 54 0.477 0.76 0.942 0.549
60 66 0.342 0.78 0.914 0.409
60 78 0.565 0.73 0.976 0.785
Weighted Average 0.75 0.958 0.651

The 1-year stability estimates are all above .9, except for the retest correlation that is based on only N = 18 participants. Given the small sample sizes, variability in estimates is mostly random noise. I computed a weighted average that takes both sample size and retest interval into account because longer time-intervals provide better information about the actual rate of change. The estimated 1-year stability is r = .96, which implies a 10-year stability of .65. This is a bit lower than Kelley’s estimates, but this might just be sampling error. It is also possible that Crook’s results underestimate long-term stability because the model assumes a constant rate of change. It is possible that this assumption is false, as we will see later.

Crook also provided a meta-analysis that included other studies and suggested a hierarchy of consistency.

Accordingly, personality traits like neuroticism are less stable than cognitive abilities, but more stable than attitudes. As the Figure shows, empirical support for this hierarchy was limited, especially for estimates of the stability of attitudes.

Several decades later, Conley (1984) reexamined this hierarchy of consistency with more data. He was also the first, to provide quantitative stability estimates that correct for unreliability. The meta-analysis included more studies and, more importantly, studies with long retest intervals. The longest retest interval was 45 years (Conley, 1983). After correcting for unreliability, the one-year stability was estimated to be r = .98, which implies a stability of r = .81 over a period of 10-years and r = .36 over 50 years.

Using the published retest correlations for with sample sizes greater than 100, I obtained a one-year stability estimate of r = .969 for neuroticism and r = .986 for extraversion. These differences may reflect differences in stability or could just be sampling error. The average reproduces Conley’s (1984) estimate of r = .98 (r = .978).

Sample Size Years retest Reliability 1-Year Rate 10-Year Rate
239 2 0.41 0.97 0.734 0.046
636 4 0.54 0.78 0.918 0.426
460 6 0.27 0.85 0.842 0.178
917 9 0.65 0.73 0.983 0.841
211 10 0.48 0.75 0.955 0.632
460 12 0.71 0.72 0.994 0.945
446 19 0.62 0.72 0.989 0.898
383 45 0.33 0.73 0.982 0.831
Weighted Average 0.74 0.969 0.730
Neuroticism
Sample Size Years retest Reliability 1-Year Rate 10-Year Rate
239 2
636 4 0.56 0.77 0.926 0.466
460 6 0.84 0.70 1.017 1.182
917 9 0.76 0.72 1.000 1.000
211 10
460 12 0.75 0.72 0.999 0.989
446 19 0.65 0.72 0.992 0.921
383 45 0.26 0.73 0.976 0.788
Weighted Average 0.73 0.986 0.868
Extraversion

To summarize, decades of research had produced largely consistent findings that the short-term (1-year) stability of personality traits is well above r = .9 and that it takes long time-periods to observe substantial changes in personality.

The next milestone in the history of research on personality stability and change was Roberts and DelVeccio’s (2000) influential meta-analysis that is featured in many textbooks and review articles (e.g., Caspi, Roberts, & Shiner, 2005; MacAdams & Olson, 2010).

Roberts and DelVeccio’s literature review mentions Conley’s (1984) key findings. “When dissattenuated, measures of extraversion were quite consistent, averaging .98 over a 1-year period, approximately .70 over a 10-year period, and approximately .50 over a 40-year period” (p. 7).

The key finding of Roberts and DelVeccio’s meta-analysis was that age moderates stability of personality. As shown in Figure 1, stability increases with age. The main limitation of Figure 1 is that the figure shows average retest correlations without a specific time interval that are not corrected for measurement error. Thus, the finding that retest correlations in early and middle adulthood (22-49) average around .6 provides no information about the stability of personality in this age group.

Most readers of Roberts and DelVeccio (2000) fail to notice a short section that examines the influence of time interval on retest correlations.

On the basis of the present data, the average trait consistency over a 1-year
period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25
(Roberts & DelVeccio, 2000, p. 16).

Using the aforementioned formula to correct for measurement error shows that Roberts and DelVeccio’s meta-analysis replicates Conley’s results, 1-year r = .983.

Years retest Reliability 1-Year Rate 10-Year Rate
5 0.52 0.72 0.989 0.894
10 0.49 0.72 0.989 0.891
20 0.41 0.73 0.985 0.863
40 0.25 0.73 0.980 0.821
Weighted Average 0.73 0.983 0.842

Unfortunately, review articles often mistake these observed retest correlations as estimates of stability. For example, Adams and Olson write “Roberts & DelVecchio (2000) determined that stability coefficients for dispositional traits were lowest in studies of children (averaging 0.41), rose to higher levels among young adults (around 0.55), and then reached a plateau for adults between the ages of 50 and 70 (averaging 0.70)” (p. 521) and fail to mention that these stability coefficients are not corrected for measurement error, which is a common mistake (Schmidt, 1996).

Roberts and DelVeccio’s (2000) article has shaped contemporary views that personality is much more malleable than the data suggest. A twitter poll showed that only 11% of respondents guessed the right answer that the one-year stability is above .9, whereas 43% assumed the upper limit is r = .7. With r = 7 over a 1-year period, the stability over 10-years would only be r = .03 over a 10-year period. Thus, these respondents essentially assumed that personality has no stability over a 10-year period. More likely, respondents simply failed to take into account how high short-term stability has to be to allow for moderately high long-term stability.

The misinformation about personality stability is likely due to vague, verbal statements and the use of effect sizes that ignore the length of the retest interval. For example, Atherton, Grijalva, Roberts, and Robins (2021) published an article with a retest interval of 18-years. The abstract describes the results as “moderately-to-high stability over a 20-year period” (p. 841). Table 1 reports the observed correlations that control for random measurement error using a latent variable model with item-parcels as indicators.

The next table shows the results for the 4-year retest interval in adolescence and the 20-year retest interval in adulthood along with the implied 1-year rates. Consistent with Roberts and DelVeccio’s meta-analysis, the 1-year stability in adolescence is lower, r = .908, than in adulthood, r = .976.

Trait Years Retest 1-Year Rate Retest Retest 1-Year Rate
Extraversion 4 0.69 0.911 20 0.66 0.979
Agreeabelness 4 0.70 0.915 20 0.61 0.976
Conscientiousness 4 0.68 0.908 20 0.57 0.972
Neuroticism 4 0.57 0.869 20 0.46 0.962
Openness 4 0.77 0.937 20 0.81 0.990
Average 4 0.68 0.908 20 0.62 0.976

However, even in adolescence the 1-year stability is high. Most important, the 1-year rate for adults is consistent with estimates in Conley’s (1984) meta-analysis and the first study in 1941 by Crook, and even Roberts and DelVeccio’s meta-analysis when measurement error is taken into account. However, Atherton et al. (2021) fail to cite historic articles and fail to mention that their results replicate nearly a century of research on personality stability in adulthood.

Stable Variance in Personality

So far, I have used a model that assumes a fixed rate of change. The model also assumes that there are no stable influences on personality. That is, all causes of variation in personality can change and given enough time will change. This model implies that retest correlations eventually approach zero. The only reason why this may not happen is that human lives are too short to observe retest correlations of zero. For example, with r = .98 over a 1-year period, the 100-year retest correlation is still r = .13, but the 200-year retest correlation is r = .02.

With more than two retest intervals, it is possible to see that this model may not fit the data. If there is no measurement error, the correlation from t1 to t3 should equal the product of the two lags from t1 to t2 and from t2 to t3. If the t1-t3 correlation is larger than this model predicts, the data suggest the presence of some stable causes that do not change over time (Anusic & Schimmack, 2016; Kenny & Zautra, 1995).

Take the data from Atherton et al. (2021) as an example. The average retest correlation from t1 (beginning of college) to t3 (age 40) was r = .55. The correlation from beginning to end of college was r = .68, and the correlation from end of college to age 40 was r = .62. We see that .55 > .68 * .62 = .42.

Anusic and Schimmack (2016)

Anusic and Schimmack (2016) estimated the amount of stable variance in personality traits to be over 50%. This estimate may be revised in the future when better data become available. However, models with and without stable causes differ mainly in predictions over long-time intervals where few data are currently available. The modeling has little influence on estimates of stability over time periods of less than 10-years.

Conclusion

This historic review of research on personality change and stability demonstrated that nearly a century of research has produced consistent findings. Unfortunately, many textbooks misrepresent this literature and cite evidence that does not correct for measurement error.

In their misleading, but influential meta-analysis, Roberts and DelVeccio concluded that “the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (p. 16).

The correct (ed for measurement error) estimates are much higher. The present results suggest consistency over a 1-year would be .98, at 5 years it would be .90, at 10-years it would be .82, at 20-years it would be .67, and at 40 years it would be .45. Long-term stability might even be higher if stable causes contribute substantially to variance in personality (Anusic & Schimmack, 2016).

The evidence of high stability in personality (yes, I think r = .8 over 10-years warrants the label high) has important practical and theoretical implications. First of all, stability of personality in adulthood is one of the few facts that students at the beginning of adulthood may find surprising. It may stimulate self-discovery and taking personality into account in major life decisions. Stability of personality also means that personality psychologists need to focus on the factors that cause stability in personality, but psychologists have traditionally focused on change because statistical tools are designed to focus on differences and deviations rather than invariances. However, just because the Earth is round or the speed of light is constant, natural sciences do not ignore these fixtures of life. It is time for personality psychologists to do the same. The results also have a (sobering) message for researchers interested in personality change. Real change takes time. Even a decade is a relatively short period to observe notable changes which is needed to find predictors of change. This may explain why there are currently no replicable findings of predictors of personality change.

So, what is the stability of personality over a one-year period in adulthood after taking measurement error into account. The correct answer is that it is greater than .9. You probably didn’t know this before reading this blog post. This does of course not mean that we are still the same person after one year or 10 years. However, the broader dispositions that are measured with the Big Five are unlikely to change in the near future for you, your spouse, or co-workers. Whether this is good or bad news depends on you.

Fact Checking Personality Development Research

Many models of science postulate a feedback loop between theories and data. Theories stimulate research that tests theoretical models. When the data contradict the theory and nobody can find flaws with the data, theories are revised to accommodate the new evidence. In reality, many sciences do not follow this idealistic model. Instead of testing theories, researchers try to accumulate evidence that supports their theories. In addition, evidence that contradicts the theory is ignored. As a result, theories never develop. These degenerative theories have been called paradigms. Psychology is filled with paradigms. One paradigm is the personality development paradigm. Accordingly, personality changes throughout adulthood towards the personality of a mature adult (emotionally stable, agreeable, and conscientious; Caspi, Roberts, & Shiner, 2005).

Many findings contradict this paradigm, but these findings are often ignored by personality development researchers. For example, a recent article on personality development (Zimmermann et al., 2021) claims that there is broad evidence for substantial rank-order and mean-level changes citing outdated references from 2000 (Roberts & DelVeccio, 2000) and 2006 (Roberts et al., 2006). It is not difficult to find more recent studies that challenge these claims based on newer evidence and better statistical analyses (Anusic & Schimmack, 2016; Costa et al., 2019). It is symptomatic of a paradigm that these findings that do not fit the personality development paradigm are ignored.

Another symptom of paradigmatic research is that interpretations of research findings do not fit the data. Zimmermann et al. (2021) conducted an impressive study of N = 3,070 students’ personality over the course of a semester. Some of these students stayed at their university and others went abroad. The focus of the article was to examine the potential influence of spending time abroad on personality. The findings are summarized in Table 1.

The key prediction of the personality development paradigm is that neuroticism decreases with age and that agreeableness and conscientiousness increase with age. This trend might be accelerated by spending time abroad, but it is also predicted for students who stay at their university (Robins et al., 2001).

The data do not support this prediction. In the two control groups, neither conscientiousness (d = -.11, d = -.02) nor agreeableness increased (d = -.02, .00) and neuroticism increased (d = .08, .02). The group of students who were waiting to go abroad, but also stayed during the study period also showed no increase in conscientiousness (d = -.22, -.02) or agreeableness (d = -.16, .00), but showed a small decrease in neuroticism (d = -.08, -.01). The group that went abroad showed small increases in conscientiousness (d = .03, .09) and agreeableness (d = .14, .00), and a small decrease in neuroticism (d = -.14, d = .00). All of these effect sizes are very small, which may be due to the short time period. A semester is simply too short to see notable changes in personality.

These results are then interpreted as being fully consistent with the personality development paradigm.

A more accurate interpretation of these findings is that the effects of spending a semester abroad on personality are very small (d ~ .1) and that a semester is too short to discover changes in personality traits. The small effect sizes in this study are not surprising given the finding that even changes over a decade are no larger than d = .1 (Graham et al., 2020; also not cited by Zimmermann et al., 2021) .

In short, the personality development paradigm is based on the assumption that personality changes substantially. However, empirical studies of stability show much stronger evidence of stability, but this evidence is often not cited by prisoners of the personality development paradigm. It is therefore necessary to fact check articles on personality development because the abstracts and discussion section often do not match the data.

Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.

Data

To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.

Conclusion

The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.