Beyond Hedonism: A Cross-Cultural Study of Subjective Life-Evaluations

Abstract (summary)

In a previous blog post (Schimmack, 2022), I estimated that affective balance (pleasure vs. pain) accounts for about 50% of the variance in subjective life-evaluations (life-satisfaction judgments). This suggests that respondents also use other information to evaluate their lives, but it is currently unclear what additional information respondents use to make life-satisfaction judgments. In this blog post, I analyzed data from Diener’s Second International Student Survey and found two additional predictors of life-satisfaction judgments, namely a general satisfaction factor (a disposition to report higher levels of satisfaction) and a weighted average of satisfaction with several life domains (financial satisfaction, relationship satisfaction, etc.). This key finding was robust across eight world regions. Another notable finding was that East Asians score much lower and Latin Americans score much higher on the general satisfaction factor than students from other world regions. Future research needs to uncover the causes of individual and cultural variation in general satisfaction.

Introduction

Philosophers have tried to define happiness for thousands of years (Sumner, 1996). These theories of the good life were objective theories that aimed to find universal criteria that make lives good. Despite some influential theories, this enterprise has failed to produce a consensual theory of the good life. One possible explanation for this disappointing outcome is that there is no universal and objective way to evaluate lives, especially in modern, pluralistic societies.

It may not be a coincidence that social scientists in the United States in the 1960s looked for alternative ways to study the good life. Rather than imposing a questionable objective definition of the good life on survey participants, they left it to their participants to define for themselves how their ideal life would look like. The first widely used subjective measure of well-being asked participants to rate their lives on a scale from 0 = worst possible life to 10 = best possible life. This measure is still used and is used in the Gallup World Poll to rank countries in terms of citizens’ average well-being.

Empirical research on subjective well-being might provide some useful information into philosophical attempts to define the good life (Kesebir & Diener, 2008). For example, hedonistic theories of well-being would predict that life-evaluations are largely determined by the amount of pleasure and pain that individuals experiences in their daily lives (Kahneman, 1999). In contrast, eudaimonic theories would receive some support from evidence that individuals’ subjective life-evaluations are based on doing good even if these good deeds do not increase pleasure. Of course, empirical data do not provide a simple answer to difficult and maybe unsolvable philosophical question, but it is equally implausible that a valid theory of well-being is unrelated to people’s evaluations of their lives (Sumner, 1996).

Although philosophers could benefit from empirical data and social scientists could benefit from the conceptual clarity of philosophy, attempts to relate the two are rare (Kesebir & Diener, 2008). This is not the place to examine the reasons for this lack of collaboration. Rather, I want to contribute to this important question by examining the predictors of life-satisfaction judgments. In a previous blog post, I reviewed 60-years of research to examine how much of the variance in subjective life-evaluations is explained by positive affect (PA) and negative affect (NA), the modern terms for the hedonic tone (good vs. bad) of everyday experiences (Schimmack, 2022). After taking measurement error into account, I found a correlation of r = .7 between affective balance (Positive Affect – Negative Affect) and subjective life-evaluations. By conventional standards in the social sciences, this is a strong correlation, suggesting that a good life is a happy life (Kesbir & Diener, 2003). However, a correlation of r = .7 implies that feelings explain only about half of the variance (we have to square .7 to get the amount of explained variance) in life-evaluations. This suggests that there is more to a good life than just feeling good. However, it is unclear what additional aspects of human lives contribute to subjective life-evaluations. To examine this question, I analyzed data from Diener’s Second International Student Survey (see, e.g., Kuppens, Realo, & Diener, 2008). Over 9,000 students from 48 different nations contributed to this study. Subjective life-evaluations were measured with Diener et al.’s (1985) Satisfaction with Life Scale. I only used the first three items because the last two items have lower validity, especially in cross-cultural comparisons (Oishi, 2006). Positive Affect was measured with two items (feeling happy, feeling cheerful). Negative Affect was measured with three items (angry, sad, and worried). The main additional predictors that might explain additional variance in life-satisfaction judgments were 18 questions about domain satisfaction. Domains ranged from satisfaction to self to satisfaction with textbooks. The main empirical question is whether domain satisfaction only predicts life-satisfaction because it increases affective balance. For example, good social relationships may increase PA and decrease NA. In this case, effects of social relationships on life-satisfaction would be explained by higher PA and lower NA, and satisfaction with social relationships would not make a unique prediction to life-satisfaction. However, satisfaction with grades might be different. Students might be satisfied with their lives if they get good grades , even if getting good grades does not increase PA or may even increase NA because studying and working hard is not always pleasurable.

The Structure of Domain Satisfaction

A common observation in studies of domain satisfaction is that satisfaction judgments in one domain tend to be positively correlated with satisfaction judgments in other domains. There are two explanations for this finding. One explanation is that personality factors influence satisfaction (Heller et al., 2004; Payne & Schimmack, 2021; Schneider & Schimmack, 2010). Individuals high in neuroticism or negative affectivity tend to be less satisfied with most life domains, especially those who are prone to depression (rather than anxiety). On the flip side, individuals who are prone to positive illusions tend to be more satisfied, presumably because they have overly positive perceptions of their lives (Schimmack & Kim, 2020). However, another factor that contributes to positive correlations among domain satisfaction ratings are response styles. Two individuals with the same level of satisfaction will use different numbers on the response scale. To separate personality effects and response styles is difficult and requires a measure of response styles or personality. This was not the case in this dataset. Thus, I was only able to identify a factor that reflects a general tendency to provide higher or lower satisfaction ratings without being able to identify the nature of this factor.

A simple way to identify a general satisfaction factor is to fit a bi-factor model to the data. I constrained the unstandardized loadings for all 18 domains to be equal. This model had good fit and only one modification index for financial satisfaction suggested a change to the model. Freeing this parameter showed a weaker loading for financial satisfaction. However, the general satisfaction factor was clearly identified. The remaining variances in the 18 domains still showed a complex pattern of correlations. The pattern of these correlations, however, is not particularly relevant for the present topic because the key question is how much of this remaining variance in domain satisfaction judgments contributes to subjective life-evaluations.

To examine this question, I used a formative measurement model. A formative measurement model is merely a weighted average of domains. The weights are empirically derived to maximize prediction of subjective life-evaluations. Thus, the 18 domain satisfaction judgments are used to create two predictors of subjective life-evaluations. One predictor is a general satisfaction factor that reflects a general tendency to report higher levels of satisfaction. The other predictor is the satisfaction in life domains after removing the influence of the general satisfaction factor.

Predicting Subjective Life-Evaluations

To examine whether the two domain satisfaction predictors add to the prediction of subjective life-evaluations, above and beyond PA and NA, I regressed LS on affective balance, general satisfaction, and domain satisfaction. I allowed for different coefficients across 7 world regions (Northern Europe/Anglo, Southern Europe, Eastern Europe, East Asia, South Asia, Latin America, & Africa). Table 1 shows the results.

The first finding is that all three predictors explain unique variance in subjective life-evaluations. This shows that the two domain satisfaction factors contribute to life-satisfaction judgments above and beyond affective balance. The second observation is that the general satisfaction factor is a stronger predictor than affective balance and the difference is significant in several regions (i.e., the 95% confidence intervals do not overlap, p < .01). Thus, it is important to study this powerful predictor of subjective life-evaluations in future research. Does it reveal personality effects or is it a mere response style? Finally, the weighted average of domain satisfaction is also a stronger predictor than affective balance except for Africa. This suggests that bottom-up effects of life domains contribute to life-evaluations. An important question for future research is to understand how life domains can be satisfying even if they do not produce high levels of pleasure or low levels of pain. Finally, there is considerable unexplained variance. Thus, future studies need to examine additional predictors of life-satisfaction judgments that produce this variation.

Table 2 shows the relationship of the general satisfaction factor with PA, NA, and affective balance. The key finding is that the general satisfaction factor was positively related to PA, negatively related to NA, and positively related to affective balance. This finding shows that the general satisfaction factor not only predicts unique variance in life-satisfaction judgments, but also predicts variance that is shared with affective balance. Thus, even well-being researchers who focus only on the shared variance between affective balance and life-satisfaction have to take the general satisfaction factor into account. The general satisfaction factor also contributes to the correlation between PA and NA. For example, for Anglo nations, the correlations of r = .50 with PA and r = -.55 imply a negative correlation of r = -.28 between PA and NA. An important question is how much of this relationship reflects real personality effects versus simple response styles.

Table 3 shows the results for the weighted average of domain satisfaction after removing the variance due to the general satisfaction factor. The pattern is similar, but the effect sizes are weaker, indicating that the general factor is more strongly related to affective balance than specific life domains.

In conclusion, domain satisfaction judgments can be divided into two components. One component represents a general disposition to provide higher satisfaction ratings. The other component represents satisfaction with specific life domains. Both components predict affective balance. In addition, both components predict subjective life-evaluations above and beyond affective balance. However, there remains substantial unexplained variance in life-satisfaction judgments that is unrelated to affective balance and satisfaction with life domains.

The contribution of Life Domains to the Weighted Average of Domain Satisfaction

Table 4 shows the domains that made a statistically significant contribution to the prediction of subjective life evaluations.

Strong effects (r > .3) are highlighted in green, whereas non-significant results are highlighted in red. The first observation is that subjective life-evaluations are influenced by many life domains with a small influence rather than a few life domains with a strong influence. This finding suggests that subjective life-evaluations do take a general picture rather than being influenced by a few, easily accessible life domains. The only exception was Africa where only two domains dominated the prediction of subjective life-evaluations. Whether this is a true cultural differences or a method problem remains to be examined in future research.

The second observation is that financial satisfaction and satisfaction with social relationships were the strongest and most consistent predictors of life-satisfaction judgments across world regions. These effects are consistent with evidence that changes in social relationships or income predict changes in life-satisfaction judgments (Diener, Lucas, & Scollon, 2006).

It is also important to remember that the difference between a statistically significant and a non-significant result is not itself statistically significant. Many of the confidence intervals are wide and overlap. Overall, the results suggest more similarity than differences across students from different world regions. Future research needs to examine whether some of the cultural differences are replicable. For example, academic abilities seem to be more important in both East and South Asia than in Latin America.

Regional Differences in Predictors of Subjective Well-Being

Table 5 shows the differences between world regions in the components that contribute to subjective life-evaluations. In this table values for global satisfaction are means, whereas the other values are intercepts that remove the influence of global satisfaction differences and domain specific differences for PA and NA and the influence of all predictors for life-satisfaction.

Red highlights show differences that imply lower well-being in comparison to the reference region Northern Europe/Anglo. The results are consistent with overall lower well-being in the other regions which is consistent with national representative surveys by Gallup.

Probably the most interesting finding is that East Asia has a very large negative difference for the global satisfaction factor. The complementary finding is Latin America’s high score on the general satisfaction factor. These finding are consistent with evidence that East Asia has lower well-being and Latin American nations have higher well-being than objective indicators of well-being like income predict. Thus, general satisfaction is likely to be a unique predictor of well-being above and beyond income and objective living conditions. The important question is whether this is merely a method artifact, as some have argued, or whether it is a real personality differences between cultures.

Homo Hedonimus: Is there more to life than maximizing pleasure and minimizing pain?

Summary

Social scientists started measuring subjective life-evaluations as well as positive and negative affective experiences in the 1960s. Sixty years of research have established that life-satisfaction judgments and the balance of PA and NA are strongly correlated in Western countries. The choice of affect items has a relatively small effect on the magnitude of the correlation. In contrast, systematic measurement error plays a stronger role. Systematic measurement error can inflate and attenuate true correlations. The existing results suggest that two sources of systematic measurement error have opposite effects. Evaluative bias inflates the observed correlation, but rater-specific measurement error attenuates the true correlation. The latter effect is stronger. As a result, multi-method studies produce stronger correlations. At present, I would interpret the data as evidence that the true correlation is around r =.7 +/- .2. (.5 to .9). This implies that affective balance explains about half of the variance in life-evaluations. Cross-cultural studies suggest that the true correlation might be lower in Asian cultures, but the difference is relatively small (.6 vs. .5, without controlling for systematic measurement error).

The finding that affective balance explains only some of the variance in life-satisfaction judgments raises an interesting new question that has not received much attention. What does lead to positive life-evaluations in addition to pleasure and pain? An exploration of this question requires the measurement of LS, PA and NA, and the specification of causal model with affective balance as a predictor of life-satisfaction. The few studies that have examined this question have found that domain satisfaction (Schimmack et al., 2002), intentional living (Busseri, 2015), and environmental mastery (Payne & Schimmack, 2021) are substantial unique predictors of subjective life-evaluations. These results are preliminary. Existing datasets and new studies can reveal additional predictors. Evidence of cultural variation in the importance of affective experiences needs to be replicated and additional moderators should be explored. Identifying a reliable set of predictors of life-satisfaction judgments can provide insights into individuals implicit definition of the good life. This information may be useful to evaluate objective theories of well-being and to evaluate the validity of life-satisfaction judgments as measures of subjective well-being. The present results are inconsistent with a view of humans as homo hedonimus, who only cares about affective experiences, but the results do suggest that pleasure and pain cannot be ignored in a theory of human well-being.

Literature Review

Positive Affect (PA) and Negative Affect (NA) are scientific constructs. People have expressed their feelings for thousands of years. Across many cultures, some emotion terms have similar meanings and are related to similar antecedents and consequences. However, I am not aware of any everyday expressions of feelings that use the terms Positive Affect or Negative Affect. Yet, the scientific concepts of PA and NA were created to make scientific claims about everyday experiences like happiness, sadness, fear, satisfaction, or frustration. The distinction between PA and NA implies that a major distinction between affects is that some affects are positive and others are negative. Yet, psychologists do not have a consensual definition of Positive Affect and Negative Affect.

While PA and NA were used occasionally in the scientific literature, the terms became popular after Bradburn developed the first measures of PA and NA and reported the results of empirical studies with Bradburn’s PA and NA scales . The first report did not even use the term affect and referred to the sales as measures of positive and negative feelings (Bradburn & Caplovitz, 1965). The terms positive affect and negative affect were introduced in the follow-up report (Bradburn, 1969).

To understand Bradburn’s concepts of PA and NA, it is useful to examine the social and historical context that led to the development of the first PA and NA scales. The scales were developed to “provide periodic inventories of the psychological well-being of the nations’ [USA] psychological well-being” (p. 1). However, the introduction also mentions the goal to “better understand the patterning of psychological adjustment” (p. 2) and “to determine the nature of mental health, as well as to determine the causes of mental illness” (p. 2). This sweeping agenda creates conceptual confusion because it is no longer clear how PA and NA are related to well-being and mental health. Although it is likely that PA and NA are related to some extent to well-being and mental health, it is unlikely that well-being or mental health can be defined in terms of PA and NA. Even if this were possible, it would only clarify the meaning of well-being and mental health, but not the meaning of PA and NA.

More helpful is Bradburn’s stated objected for developing his PA and NA scales. The goal was to “measure a wide range of pleasurable and unpleasurable experiences apt to be common in a heterogeneous population” (Bradburn & Caplovitz, 1965; p. 16). This statement of the objective makes it clear that Bradburn used the term positive affect to refer to pleasurable experiences and the term negative affect to refer to unpleasant experiences. Bradburn (1969) is even more explicit. His assumption for the validity of the self-report measure was that “people tend to code their experiences in terms of (among other things) their affective tone – positive, neutral, or negative. For our purposes, the particular content of the experience is not important. We are concerned with the pleasurable or unpleasurable character associated with the experience” (p. 54). Other passages also make it clear that Bradburn’s goal was to measure the hedonic tone of everyday experiences. In short, the distinction between PA and NA is based on the hedonic tone of the affective experiences. PA feels good and NA feels bad.

Bradburn’s (1969) final chapter provides the most important information about his sometimes implicit assumptions underlying his approach to the study of psychological well-being, mental health, or happiness. “We are implicitly stating our belief that the modern concept of mental health is really a concerns about the subjective sense of well-being, or what the Greeks called eudaimonia” (p. 225). It is also noteworthy that Bradburn did not reduce happiness to the balance of PA and NA. “By naming our forest “psychological well-being,” we have not meant to imply that concepts such as self-actualization, self-esteem, ego-strength, or autonomy, …., are irrelevant to our study… While we have said relatively little about these particular trees, we do not doubt that they are an integral and important part of the whole” (p. 224). Accordingly, Bradburn rejects the hedonistic idea that well-being can be reduced to the balance of pleasure and pain, but he assumed that PA and NA are important to the conception of a good life.

However, defining well-being in terms of PA, NA, and other good things in life is not a satisfactory definition of well-being. A complete theory of well-being would have to list the additional ingredients and justify their inclusion in a definition of well-being. Philosophers and some psychologists have tried to defend different conceptions of the good life (Sumner, 1996). The main limitation of these proposals is that it is difficult to defend one conception of the good life as superior to another. The key problem is that it is difficult to find a universal, objective criterion that can be used to evaluate individuals’ lives (Sumner, 1996).

One solution to this problem is to take a subjective perspective. Accordingly, individuals can chose their own ideals and evaluate their lives accordingly. In the 1960s, social scientists developed subjective measures of well-being. One of the first measures was Cantril’s ladder that asked respondents to place their actual lives on a ladder from 0 = worst possible life to 10 = best possible life. This measure does not impose any criteria on the life-evaluations. This measure continues to be used to this day. The measure is a subjective measure of well-being because respondents can use any information that they consider to be important to rate their lives. In theory, they could rely exclusively on the hedonic tone of their everyday experiences. In this case, we would expect a strong correlation between affective balance and life-evaluations. However, it is also possible that individuals follow other goals that do not aim to maximize pleasure and to minimize pain. In this case, the correlation between affective balance and life-evaluations would be attenuated. It is therefore interesting to examine empirically how much of the variance in life-evaluations or life-satisfaction judgments is explained by the hedonic tone of everyday experiences. Subsequently, I review the relevant studies that have examined this question over the past 50 years.

Bradburn (1969) simply states that “the difference between the numbers of positive and negative feelings is a good predictor of a person’s overall ratings of his own happiness” (p. 225), but he did not provide quantitative information about the amount of explained versus unexplained variance.

The next milestone in well-being research was Andrews and Whitey’s (1976) examination of the validity of well-being measures. They included Bradburn’s items, but modified the response format from a dichotomous yes/no format to a frequency format. They assumed that this might produce negative correlations between the PA and NA scales, but this expectation was not confirmed. More interesting is how much the balance of PA and NA correlated with subjective well-being ratings. The key finding was that affect balance scores correlated only r = .43 with a 7-point life-satisfaction rating, and r = .47 with a 7-point happiness scale, while the two global ratings correlated r = .63 with each other. Corrected for unreliability, this suggest that affective balance is strongly correlated with global life-evaluations, ((.43 + .47)/2)/sqrt(.63) = .57. Nevertheless, a substantial portion of the variance in global life-satisfaction judgments remains unexplained, 1-.57^2 = 68%. This finding undermines theories of well-being that define well-being exclusively in terms of the amount of PA and NA (Bentham, Kahneman, 1999). However, the evidence is by no means conclusive. Systematic measurement error in the PA and NA scales might severely attenuate the true influence of PA and NA on global life-evaluations, given the low convergent validity between self-ratings and informant ratings of affective experiences (Schneider & Schimmack, 2009).

Nearly a decade later, Diener (1984) published a highly influential review article on the field of subjective well-being research. In this article, he coined the term subjective well-being (SWB) for research on global life-satisfaction judgments and affective balance. SWB was defined as high life-satisfaction, high PA and low NA. Diener noted that the relationship among the three components of his SWB construct is an empirical question. He also pointed out that the relationship between PA and NA had received a lot of attention, whereas the relationship between affective balance and life-satisfaction “has not been as thoroughly researched” (p. 547). Surprisingly, this statement still rings true nearly 40 years later, despite a few attempts by Diener and his students, including myself, to study this relationship.

For the next twenty years, the relationship between PA and NA became the focus of attention and fueled a heated debate with proponents of independence (Watson, Clark, & Tellegen, 1988), bipolarity (Russell, 1980), and models of separate, yet negatively correlated dimensions (Diener, Smith, & Fujita, 1995). A general agreement is that time frame, response formats, and item selection influences the correlations among PA and NA measures (Watson, 1988). This raises a question about the validity of different PA and NA scales. If different scales produce different correlations between PA and NA, different scales may also produce different correlations between life-evaluations and affective balance. However, this question has not been systematically examined to this day.

To make matters worse, the debate about the structure of affect also led to confusion about the meaning of the terms PA and NA. Starting in the 1980s, Watson and colleagues started to use the terms as labels for the VARIMAX rotated first-two factors in exploratory factor analyses of correlations among affect ratings (Watson & Tellegen, 1985). They also used these labels for their Positive Affect and Negative Affect scales that were designed to measure these two factors (Watson, Clark, & Tellegen, 1988). They defined Positive Affect as a state of high energy, full concentration, and pleasurable engagement and Negative Affect as a state of subjective distress and unpleasurable engagement. An alternative model based on the unrotated factors, however, identifies a first factor that distinguishes affects based on their hedonic tone. Watson et al. (1988) refer to this factor as pleasantness-unpleasantness factor. Thus, PA is no longer equivalent with pleasant affect, and NA is no longer equivalent with unpleasant affect.

To avoid conceptual confusion, different labels have been proposed for measures that focus on hedonic tone and measures that focus on the PANAS dimensions. Some researchers have suggested to use pleasant affect and unpleasant affect for measures of hedonic tone. Others have proposed to label Watson and Tellegen factors Positive Activation and Negative Activation. In the broader context of research on well-being, PA and NA are often used in Bradburn’s tradition to refer to the hedonic tone of affective experiences, and I will follow in this tradition. I will refer to the PANAS scales as measures of Positive Activation and Negative Activation.

While it is self-evident that the PANAS scales are different from measures of hedonic tone, it is still possible that the difference between Positive Activation and Negative Activation is a good measure of affective balance. That is, individuals who often experience positive activation and rarely experience negative activation are in a pleasant affective state most of the time. In contrast, individuals who experience a lot of Negative Activation and rarely experience Positive Activation are expected to feel bad most of the time. Whether the PANAS scales are capable of measuring hedonic tone as well as other measures is an empirical question that has not been examined.

The next important article was published by Lucas, Diener, and Suh (1996). The authors aimed to examine the relationship between the cognitive component of SWB (i.e., life-satisfaction) and the affective component of SWB (i.e., PA and NA) using a multi-trait-multi-method approach (Campbell & Fiske, 1959). Study 1 used self-ratings and informant ratings of life-satisfaction on the Satisfaction with Life Scale and PANAS scores to examine this question. The key finding was that same-construct correlations were higher (i.e., LS r = .48, PA r = .43, NA r = .26) than different-construct correlations (i.e., LS-PA rs = .28, .31, LS-NA r = -.16, -.21, PA-NA r = -.02, -.14). This finding was interpreted as evidence that “life satisfaction is discriminable from positive and negative affect” (p. 616). The main problem with this conclusion is that the results do not directly examine the discriminant validity of life-satisfaction and affective balance. As affective balance is made up of two distinct components, PA and NA, it is self-evident that LS cannot be reduced to PA or NA alone. However, it is possible that life-satisfaction is strongly related to the balance of PA and NA. To examine this question it would have been necessary to compute an affective balance score or to use a latent variable model to regress life-satisfaction onto PA and NA. The latter approach can be applied to the published correlation matrix. I conducted a multiverse analysis with five different models that make different assumptions about the validity of self-ratings and informant ratings. The results were very similar and suggested that affective balance explains about half of the variance in life-satisfaction judgments, rs = .68 to .75. The higher amount of explained variance is partially explained by the lower validity of Bradburn’s scales (Watson, 1988) and partially due to the use of a multi-method approach as mono-method relationships were only r = .6, for self-ratings at Time 1, and r = .5, for self-ratings at time 2 (Lucas et al., 1996). In conclusion, Lucas et al.’s study provided evidence that life-satisfaction judgments are not redundant with affective balance when affective balance is measured with the PANAS scales. However, it is possible that other measures of PA and NA might be more valid and explain more variance in life-evaluations.

A couple of years later, Diener and colleagues presented the first article that focused on the influence of affective balance on life-satisfaction judgments (Suh, Diener, Oishi, & Triandis, 1998). The main focus of the article was cultural variation in the relationship between life-satisfaction and affective balance. Study 1 examined correlations in the World Value Survey that used Bradburn’s scales. Correlations with a single-item life-satisfaction judgment ranged from a maximum of r = .57 in West Germany to a minimum of r = .20 in Nigeria. The correlation for the US sample was r = .48, which closely replicates Andrews and Whitey’s results. Study 2 used the more reliable Satisfaction with Life Scale and hedonic items with an amount of time response format. This produced stronger correlations. The correlation for the US sample was r = .64. This is consistent with Lucas et al.’s (1996) mono-method results. This article suggested that affect contributes to subjective well-being, but does not determine it, and that culture moderates the use of affect in life-evaluations.

Diener and colleagues followed up on this finding, by suggesting that the influence of neuroticism and extraversion on subjective well-being is mediated by affective balance (Schimmack, Diener, & Oishi, 2002). The article also explored whether domain satisfaction might explain additional variance in life-satisfaction judgments. The key finding was that affective balance made a unique contribution to life-satisfaction judgments (b = .45), but two life-domains also made unique contributions (i.e., academic satisfaction, b = .27, romantic satisfaction, r = .23). Affective balance mediated the effects of extraversion and neuroticism. Schimmack et al. (2002) followed up on these findings by examining the mediating role of affective balance across cultures. They replicated Suh et al.’s (1998) finding that culture moderates the relationship between affective balance and life-satisfaction and found a strong relationship in the two Western cultures (US, German) in a structural equation model that controlled for random measurement error, r = .76. The stronger relationship might be due to the use of affect items that focus on hedonic tone.

The next big development in well-being research was the creation of Positive Psychology; the study of all things positive. Positive psychology promoted eudaimonic conceptions of well-being that are rooted in objective theories of well-being (Sumner, 1996). These theories clash with subjective theories of well-being that leave it to individuals to choose how they want to live their lives. An influential article by Keyes, Shmotkin, & Ryff (2002) pitted these two conceptions of well-being against each other, using the Midlife in the U.S. (MIDUS) sample (N = 3,032). The life-satisfaction item was Cantril’s ladder. The PA and NA items were ad-hoc items with an amount of time response format. This explains why the MIDUS PA and NA scales are strongly negatively correlated, r = -.62. PA and NA were also strongly correlated with LS, PA r = .52, NA r = -.46. The article did not examine the relationship between life-satisfaction and affective balance because the authors treated LS, PA, and NA as indicators of a latent variable. According to this model, neither life-satisfaction nor affective balance measure well-being. Instead, well-being is an unobserved construct that is reflected in the shared variance among LS, PA, and NA. Using the published correlations and assuming a reliability of .7 for the single-item life-satisfaction item (Busseri, 2015), I obtained a correlation of r = .66 between life-satisfaction and affective balance. This correlation is stronger than the correlation with the PANAS scales in Lucas et al.’s (1996) study, suggesting that hedonic PA and NA scales are more valid measures of hedonic tone of everyday experiences and produce correlations around r = .7 with life-satisfaction judgments in the United States.

In the 21st century, psychologists’ interest in the determinants of life-satisfaction judgments decreased for a number of reasons. Positive psychologists were more interested in exploring eudaimonic conceptions of well-being. They also treated life-satisfaction judgments as indicators of hedonic well-being and treated life-satisfaction judgments and affective measures as interchangeable indicators of hedonic well-being. Another blow to research on life-satisfaction was Kahneman’s suggestion that life-satisfaction judgments are unreliable and invalid (Kahneman, 1999; Schwarz & Strack, 1999) and his suggestion to focus on affective balance as the only criterion for well-being. Kahneman et al. (2006) reported that income predicted life-satisfaction judgments, but not measures of affective balance. However, this finding was not interpreted as a discovery that income influences well-being independently of affect, but rather as evidence that life-satisfaction judgments are invalid measures of well-being.

In contrast, sociologists continued to focus on subjective well-being and used life-satisfaction judgments as key indicators of well-being in important panel studies such as the General Social Survey, the German Socio-Economic Panel (SOEP), and the World Value Survey. Economists rediscovered happiness, but relied on life-satisfaction judgments to make policy recommendations (Diener, Lucas, Schimmack, & Helliwell, 2008). Although Gallup measures all three components of SWB, it relies exclusively on life-satisfaction judgments to rank nations in terms of happiness (World Happiness Reports, https://worldhappiness.report).

In 2008, I used data from a pilot study for the SOEP to replicate the finding that affective balance mediated the effects of extraversion and neuroticism (Schimmack, Schupp, & Wagner, 2008). The study also controlled for evaluative biases in self-ratings. In addition, unemployment and regional differences between former East and West Germany were unique predictors of life-satisfaction judgments. The unique effect of affective balance on life-satisfaction was r = .50. One reason for the weaker relationship is that the model controlled for shared method variance among life-satisfaction and affect ratings.

Kuppens, Realo, and Diener (2008) followed up on Suh et al.’s (1996) finding that culture moderates the relationship between affective balance and life-satisfaction. While they replicated that culture moderates the relationship, the use of a multi-level model with unstandardized scores made it difficult to assess the magnitude of these moderator effects. Furthermore, the authors examined moderation for the effects of PA and NA separately rather than evaluating cultural variation in the relationship between affective balance and life-satisfaction. Finally, the use of PA and NA scales makes it impossible to evaluate measurement equivalence across nations. Using the same data, I examined the relationship between affective balance and life-satisfaction using a multi-group structural equation model with a largely equivalent measurement model across 7 world regions (Northern Europe/Anglo, Southern Europe, Eastern Europe, East Asia, South Asia, Latin America, and Africa). I replicated that the correlation in Western countries is around r = .6 (Northern Europe/Anglo, r = .64, Southern Europe, r = .59). The weakest relationships were found in East Asia (r = .52) and South Asia (r = .51). While this difference was statistically significant, the effect size is rather small and suggests that affective balance contributes to life-satisfaction judgments in all cultures. A main limitation of this study is that it is unclear how much cultural differences in response styles contribute to the moderator effect. A comparison of the intercept of life-satisfaction (i.e., mean difference after controlling for mean differences in PA and NA) showed that all regions had lower life-satisfaction intercepts than the North-American/Anglo comparison group. This shows that factors unrelated to PA and NA (e.g., income, Kahneman et al., 2006) produce cultural variation in life-satisfaction judgments.

Zou, Schimmack, and Gere (2013) published a replication study of Lucas et al.’s sole multi-method study. The study was not a direct replication. Instead, it addressed several limitations in Lucas et al.’s study. Most importantly, it directly examined the relationship between life-satisfaction and affective balance. It also ensured that correlations are not attenuated by biases in life-satisfaction judgments by adding averaged domain satisfaction judgments as a predictor. The study also used hedonic indicators to measure PA and NA rather than assuming that the rotated Positive Activation and Negative Activation factors fully capture hedonic tone. Finally, the sample size was five times larger than in Lucas et al.’s study and included students and middle aged individuals (i.e., their parents). The results showed convergent and discriminant validity for life evaluations (global & averaged domain satisfaction), PA, and NA. Most important, the correlation between the life-evaluation factor and the affective balance factor was r = .90. While this correlation still leaves 20% unexplained variance in life-evaluations, it does suggest that the hedonic tone of life experiences strongly influences subjective life-evaluations. However, there are reasonable concerns that this correlation overestimates the importance of hedonic experiences. One problem is that judgments of hedonic tone over an extended period of time may be biased by life-evaluations. To address this concern it would be necessary to demonstrate that affect ratings are based on actual affective experiences rather than being inferred from life-evaluations.

Following a critical discussion of Diener’s SWB concept (Busseri & Sadava, 2011), Busseri tackled the issue empirically using the MIDUS data. To do so, Busseri (2015) examined how LS, PA, and NA are related to predictors of SWB. He explicitly examined which predictors may have a unique influence on life-satisfaction judgments above and beyond the influence of PA and NA. The main problem was that the chosen predictors had weak relationships with the well-being components. The main exception was the Intentional Living scale; that is, an average of ratings of how much effort respondents invest into work, finances, relationships, health, and life overall. This scale had a strong unique relationship with life-evaluations, b = .44, that was as strong as the unique effect of PA, b = .42, and stronger than the unique effect of NA, b = -.16. The study also replicated Kahneman et al.’s (2006) finding that income is a unique predictor of LS and unrelated to PA and NA, but even the effect of income is statistically small, b = .05. Using the published correlation matrix and correcting LS for unreliability, I found a correlation of r = .58 for LS and affective balance. The unique relationship after controlling for other predictors was r = .52, suggesting that most of the relationship between affective balance and life-satisfaction is direct and not spurious due to third variables that influence affective balance and life-satisfaction.

Payne and Schimmack (2022) followed up on Zou et al.’s (2013) study with a multiverse analysis. PA and NA were measured with different sets of items ranging from pure hedonic items (good, bad), happiness and sadness items, to models of PA and NA as higher order factors of several positive (joy, love, gratitude) and negative (anger, fear, sadness) affects (Diener et al., 1995). They also compared results for mono-method (only self-ratings) and multi-method (ratings by all three family members) measurement models. Finally, results were analyzed separately for students, mothers, and fathers as targets. They key finding was that item selection had a very small influence, whereas the comparison of mono-method and multi-method studies made a bigger difference. The mono-method results ranged from r = .64, 95%CI = .58 to .71 to r = .69, 95%CI = .63 to .75. The multi-method results ranged from r = .71, 95%CI = .62 to .81, to r = .86, 95%CI = .80 to .92. These estimates are somewhat lower than Zou et al.’s (2013) results and suggest that the true relationship is less than r = .9.

In Study 2, Payne and Schimmack (2022) conducted the first direct comparison of PANAS items with hedonic tone items using an online sample. They found that PANAS NA was virtually identical with other NA measures. This refutes the interpretation of PANAS NA as a measure of negative activation that is distinct from hedonic tone. However, PANAS PA was distinct from other PA measures and was a weaker predictor of life-evaluations. A latent variable model with the PANAS items produced a correlation of r = .78, 95%CI = .73 to .82. An alternative measure that focusses on hedonic tone, the Scale of Positive and Negative Experiences (SPANE, Diener & Bieswas-Diener, 2009) yielded a slightly stronger correlation, r = .83, 95% .79 to .86. In a combined model, the SPANE PA factor was a stronger predictor than the PANAS PA factor. Thus, PANAS scales are likely to underestimate the contribution of affect to life-evaluations, but the difference is small. The correlations might be stronger than in other studies due to the use of an online sample.

To summarize, correlations between affective balance and life-evaluations range from r = .5 to r = .9. Several methodological factors contribute to this variation, and studies that use more valid PA and NA scales and control for measurement error produce stronger correlations. In addition, culture can moderate this relationship but it is not clear whether culture influences response styles or actual differences in the contribution of affect to life-evaluations. A reasonable estimate of the true correlation is r = .7 (+/- .2), which suggests that about 50% of the variance in life-evaluations is accounted for by variation in the hedonic tone of everyday experiences. An important direction of future research is to identify the unique predictors of life-evaluations that explain the remaining variance in life-evaluations. Hopefully, it will not take another 60 years to get a better understanding of the determinants of individuals’ life-evaluations. A better understanding of life-satisfaction judgments is crucial for the construct validation of life-satisfaction judgments before they can be used to make claims about nations’ well-being and to make public policy recommendations.

Democracy and Citizens’ Happiness

For 30 years, I have been interested in cultural differences. I maintained a database of variables that vary across cultures, starting with Hofestede’s seminal rankings of 40 nations. Finding interesting variables was difficult and time consuming. The world has changed. Today it is easy to find interesting data on happiness, income, or type of government. Statistical software is also free (project R). This has changed the social sciences. Nowadays, the new problem is that data can be analyzed in many ways and that results can be inconclusive. As a result, social scientists can disagree even when the analyze the same data. Here I focus on predictors of national differences in happiness.

Happiness has been defined in many ways and any conclusion about national differences in happiness depends on the definition of happiness. The most widely used definition of happiness in the social sciences is subjective well-being. Accordingly, individuals define for themselves what they consider to be a good life and evaluate how close their actual lives are to their ideal lives. The advantage of this concept of well-being is that it does not impose values on the concept of happiness. Individuals in democratic countries could evaluate their lives based on different criteria than individuals in non-democratic countries. Thus, subjective well-being is not biased in favor of democracy, even though subjective conceptions of happiness emerged along with democracy in Western countries.

The most widely used measure of subjective well-being is Cantril’s ladder. Participants rate their lives on a scale from 0 = worst possible life to 10 = best possible life. This measure leaves it to participants to define what the worst or best possible life it. The best possible life in Denmark could be a very different life than the best possible life in Zimbabwe. Ratings on Cantril’s ladder are imperfect measures of subjective well-being and could distort comparisons of countries, but these ratings are currently used to compare the happiness of over 100 countries (WHR).

The Economist’s Intelligence Unit (EUI) has created ratings of countries’ forms of government that provides a measure of democracy (Democracy Index). Correlating the 2020 happiness means of countries with the democracy index produces a strong (linear) correlation of r = .68 (rank correlation r = .71).

This finding has been used to argue that democracies are better societies because they provide more happiness for their citizens (Williamson, 2022).

So the eastward expansion of democracy isn’t some US-led conspiracy to threaten Russia; it reflects the fact that, when given the choice, citizens tend to choose democracy and hope over autocracy and fear. They know instinctively that it brings a greater chance for happiness.

Although I am more than sympathetic to this argument, I am more doubtful that democracy alone is sufficient to produce more happiness. A strong correlation between democracy and happiness is insufficient to make this argument. It is well known that many predictors of nations’ happiness scores are strongly corelated with each other. One well known predictor is nations’ wealth or purchasing power. Money does buy essential goods. The best predictor of happiness is the median income per person that reflects the spending power of average citizens and is not distorted by international trade or rich elites.

While it is known that purchasing power is a predictor of well-being, it is often ignored how strong the relationship is. The linear correlation across nations is r = .79 (rank r = .82). It is often argued that the relationship between income is not linear and that money is more important in poorer countries. However, the correlation with log income is only slightly higher, r = .83.

This might suggest that purchasing power and democracy are both important for happiness. However, purchasing power and democracy are also strongly correlated, (linear r = .72, rank = .75). Multiple regression analysis can be used to see whether both variables independently contribute to the prediction of happiness.

Of course, dollars cannot be directly compared to ratings on a democracy index. To make the results comparable, I scored both variables from 0 for the lowest possible score to 1 for the highest possible score. For purchasing power, this variable ranged from Madagascar ($398) to Luxembourg ($26,321). For democracy, this variable ranged from Myanmar (1.02) to Norway (9.75).

The results show that purchasing power is a much stronger predictor of happiness than democracy.

The model predicts that a country with the lowest standing on purchasing power and democracy has a score of 3.63 on Cantril’s happiness measure. Increasing wealth to the maximum level without changing democracy would increase happiness to 3.63 + 3.13 = 6.76. In contrast, keeping purchasing power at the lowest level and increasing democracy to the highest level would increase happiness only to 3.63 + 0.48 = 4.11. One problem with statistical analyses across nations is that the sample size is limited by the number of nations. As a result, the positive relationship with democracy is not statistically significant and it is possible that the true effect is zero. In contrast, the effect of purchasing power is highly significant and it is unlikely (less than 5%) that the increase is less than 2.5 points.

Do these results imply that democracy is not very important for citizens’ happiness? Not necessarily. A regression analysis ignores the correlation between the predictor variables. It is possible that the correlation between purchasing power and democracy reflects at least in part a causal effect of democracy on wealth. For example, democratic governments may invest more in education and innovation and achieve higher economic growth. Democracies may also produce better working conditions and policies that benefit the working class rather than wealthy elites.

I will not repeat the mistake of many other social scientists to end with a strong conclusion that fits their world views based on weak and inconclusive data. The main aim of this blog post is to warn readers that social science is much more complicated than the natural sciences. Follow the science makes a lot of sense, when large clinical trials show strong effectiveness of drugs or vaccines. The social sciences can provide valuable information, but do not provide simple rules that can be followed to increase human well-being. This does not mean that social science is irrelevant. Ideally, social scientists would provide factual information and leave the interpretation to educated consumers.

Interpreting discrepancies between Self-Perceptions and IAT scores: Who is defensive?

In 1998, Anthony G. Greenwald and colleagues introduced the Implicit Association Test. Since then, Implicit Association Tests have been used in thousands of studies with millions of participants to study stereotypes and attitudes. The most prominent and controversial use of the race IAT that has been used to argue that many White Americans have more negative attitudes towards African Americans than they admit to others or even to themselves.

The popularity of IATs can be attributed to the use of IATs on the Project Implicit website that provides visitors of the website with the opportunity to take an IAT and to receive feedback about their performance. Over 1 million visitors have received feedback about their performance on the race IAT (Howell, Gaither, & Ratliff, 2015).

Providing participants with performance feedback can be valuable and educational. Coaches provide feedback to athletes so that they can improve their performance, and professors provide feedback about performance during midterms so that students can improve their performance on finals. However, the value of feedback depends on the accuracy of the feedback. As psychological researchers know, providing participants with false feedback is unethical and requires extensive debriefing to justify the use of false feedback in research. it is therefore crucial to examine the accuracy of performance feedback on the race IAT.

At face value, IAT feedback is objective and reflects participants’ responses to the stimuli that were presented during an IAT. However, this performance feedback should come with a warning that performance could vary across repeated administration of a test. For example, the retest reliability of performance on the race IAT has been estimated to be between r = .2 and r = .5. Even using a value of r = .5 implies that there is only a 75% probability that somebody with a score above average receives a score above average again on a second test (Rosenthal and Rubin, 1982).

However, the Project Implicit website gives the false impression that performance on IATs is rather consistent, while avoiding quantitative information about reliability.

FAQ5

Unreliability is not the only reason why performance feedback on the Project Implicit website could be misleading. Another problem is that visitors may be given the impression that performance on the race IAT reveals something about themselves that goes beyond performance on this specific task. One possible interpretation of race IAT scores is that they reveal implicit attitudes or evaluations of Black and White Americans. These implicit attitudes can be different from attitudes that individuals think they have that are called explicit attitudes. In fact, Greenwald et al. (1998) introduced IATs as a method that can detect implicit attitudes that can differ from explicit attitudes and this dual-attitude model has fueled interest in IATs.

The Project Implicit website does not provide a clear explanation of what Implicit association Tests test. Regarding the race IAT, visitors are told that it is not a measure of prejudice, but that it does measure their biases, even if these biases are not endorsed or contradict conscious beliefs.

FAQ11

However, other frequently asked question implies that IATs measure implicit stereotypes and attitudes. One question is how IATs measure implicit attitudes, implying that it can measure implicit attitudes (and that implicit attitudes exist).

FAQ2

Another one implies that performance on the race IAT reveals implicit attitudes that reflect cultural biases.

In short, while Project Implicit may not provide a clear explanation of what is being tested with an Implicit Association Test, it is strongly implied that test performance reveals something about participants’ racial biases that may contradict their self-perceptions.

An article by Howell, Gaither, and Ratliff (2015) makes this assumption explicit. This article examines how visitors of the Project Implicit website respond to performance feedback on the race IAT. The key claim of this article is that “people are generally defensive in response to feedback indicating that their implicit attitudes differ from their explicit attitudes” (p. 373). This statement rests on two assumptions. First, it makes the assumption of dual-attitude models that there are explicit and implicit attitudes, as suggested by Greenwald et al. (1998). Second, it implies that performance on a single race IAT provides highly valid information about implicit attitudes. These assumptions are necessary to place researchers in the position of an expert that know individuals’ implicit attitudes, just like a psychoanalyst is in a superior position to understand the true meaning of a dream. If test takers reject the truth, they are considered defensive because they are unwilling to accept the truth.

To measure defensiveness, Howell et al. (2015) used answers to three questions after visitors of the Project Implicit website received performance feedback on the race IAT, namely
(a) the IAT does not reflect anything about my thoughts or feelings unconscious or otherwise,
(b) whether I like my IAT score or not, it captures something important about me (reversed)
(c) the IAT reflects something about my automatic thoughts and feelings concerning this topic (reversed). Responses were made on a 1 = strongly disagree to 4 = strongly agree. On this scale, a score of 2.5 would imply neither agreement nor disagreement with the aforementioned statements.

There was hardly any difference in defensiveness scores between White (M = 2.31, SD = 0.68) Black (M = 2.38, SD = 0.74) or biracial (M = 2.33, SD = 0.73) participants. For White participants, a larger pro-White discrepancy was correlated with higher defensiveness scores, partial r = .16. The same result was found for Black participants, partial r = .13. A similar trend emerged for biracial participants. While these correlations are weak, they suggest that all three racial groups were less likely to believe in the accuracy of the feedback when the IAT scores showed a stronger pro-White bias than the self-ratings implied.

Howell et al. (2015) interpret these results as evidence of defensiveness. Accordingly, “White individuals want to avoid appearing racist (O’Brien et al., 2010) and Black individuals value pro-Black bias (Sniderman & Piazza, 2002)” (p. 378). However, this interpretation of the results rests on the assumption that the race IAT is an unbiased measure of racial attitudes. Howell et al. (2015) ignore a plausible alternative explanation of their results. The alternative explanation is that performance feedback on the race IAT is biased in favor of pro-White attitudes. One source of this bias could be the scoring of IATs which relies on the assumption that neutral attitudes correspond to a zero score. This assumption has been challenged in numerous articles (e.g., Blanton, Jaccard, Strauts, Mitchell, & Tetlock, 2015). It is also noteworthy that other implicit measures of racial attitudes show different results than the race IAT (Judd et al., 1995; Schimmack & Howard, 2021). Another problem is that there is little empirical support for dual-attitude models (Schimmack, 2021). Thus, it is impossible for IAT scores to provide truthful information that is discrepant from individuals’ self-knowledge (Schimmack, 2021).

Of course, people are defensive when they are confronted with unpleasant information and inconvenient truths. A prime example of defensiveness is the response of the researchers behind Project Implicit to valid scientific criticism of their interpretation of IAT scores.

About us

Despite several inquires about questionable or even misleading statements on the frequently asked question page, Project Implicit visitors are not informed that the wider scientific community has challenged the interpretation of performance feedback on the race IAT as valid information about individuals implicit attitudes. The simple fact that a single IAT score provides insufficient information to make valid claims about an individuals’ attitudes or behavioral tendencies is missing. Visitors should be informed that the most plausible and benign reason for a discrepancy between their test scores and their beliefs is that test scores could be biased. However, Project Implicit is unlikely to provide visitors with this information because the website is used for research purposes and willingness to participate in research might decrease when participants are told the truth about the mediocre validity of IATs.

Proponents of IATs often argue that taking an IAT can be educational. However, Howell et al. (2015) point out that even this alleged benefit is elusive because individuals are more likely to believe themselves than the race IAT feedback. Thus, rejection of IAT feedback, whether it is based on defensiveness or valid concerns about the validity of the test results, might undermine educational programs that aim to reduce actual racial biases. It is therefore problematic to use the race IAT in education and intervention programs.

2021 Replicability Report for the Psychology Department at the University of Amsterdam

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of Amsterdam

The University of Amsterdam is the highest ranked European psychology department (QS Rankings). I used the department website to find core members of the psychology department. I found 48 senior faculty members. Not all researchers conduct quantitative research and report test statistics in their result sections. Therefore, the analysis is limited to 25 faculty members that had at least 100 test statistics.

A search of the database retrieved 13,529 test statistics. This is the highest number of statistical tests for all departments examined so far (Department Rankings). This partially explains the high ranking of the University of Amsterdam in rankings of prestige.

Figure 1 shows the z-curve plot for these results. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 2,034 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the computation of the meta-statistics.

2. Visual inspection of the histogram shows a drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 35% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 35% EDR provides an estimate of the extent of selection for significance. The difference of ~35 percentage points is large in absolute terns, but relatively small in comparison to other psychology departments. The upper level of the 95% confidence interval for the EDR is 46%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (70% vs. 72%) is similar, but the EDR is higher (35% vs. 28%), suggesting less severe selection for significance by faculty members at the University of Amsterdam that are included in this analysis.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 66% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 35% is below the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For the University of Amsterdam, the ARP is (70 +35)/2 = 53%. This is somewhat higher than the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, research from the University of Amsterdam is expected to replicate at a higher rate than the replication rate for psychology in general.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 35% implies that no more than 10% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 23%, allows for 18% false positive results. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 2% with an upper limit of the 95% confidence interval of 4%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at Western University.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. It is particularly interesting to examine changes at the University of Amsterdam because Erik-Jan Wagenmakers, a faculty member in the Methodology department, is a prominent advocate of methodological reforms. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results are disappointing. There is no evidence that research practices have changed in response to concerns about replication failures. The EDR estimate dropped from 35% to 25%, although this is not a statistically significant change. The ERR also decreased slightly from 72% to 69%. Therefore, the predicted success rate for actual replication studies decreased from 51% to 47%. This means that the University of Amsterdam decreased in rankings that focus on the past five years because some other departments have improved.

The replication crisis has been most severe in social psychology (Open Science Collaboration, 2015) and was in part triggered by concerns about social psychological research in the Netherlands. I therefore also conducted a z-curve analysis for the 10 faculty members in social psychology. The EDR is lower (24% vs. 35%) than for the whole department, which also implies a lower actual replication rate and a higher false positive risk.

There is variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Jaap M. J. Murre 77 81 74 2
2 Hilde M. Geurts 73 76 69 2
3 Timo Stein 73 76 70 2
4 Hilde M. Huizenga 68 75 61 3
5 Maurits W. van der Molen 65 72 57 4
6 Astrid C. Homan 62 69 55 4
7 Wouter van den Bos 60 74 47 6
8 Frenk van Harreveld 54 64 44 7
9 Gerben A. van Kleef 53 70 37 9
10 K. Richard Ridderinkhof 53 69 36 9
11 Bruno Verschuere 52 72 32 11
12 Maartje E. J. Raijmakers 51 74 28 13
13 Merel Kindt 48 62 35 10
14 Mark Rotteveel 47 59 34 10
15 Sanne de Wit 47 74 20 22
16 Susan M. Bogels 44 63 26 15
17 Matthijs Baas 44 62 25 16
18 Arnoud R. Arntz 43 68 17 26
19 Filip van Opstal 43 65 20 20
20 Suzanne Oosterwijk 42 56 29 13
21 Edwin A. J. van Hooft 40 65 15 30
22 E. J. B. Doosje 38 61 15 31
23 Nils B. Jostmann 37 48 26 15
24 Barbara Nevicka 37 59 14 33
25 Reinout W. Wiers 36 47 25 15

2021 Replicability Report for the Psychology Department at Western University

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Western University

I used the department website to find core members of the psychology department. I found 35 faculty members at the associate (9) or full professor (26) level. Not all researchers conduct quantitative research and report test statistics in their result sections. Therefore, the analysis is limited to 14 faculty members that had at least 100 test statistics.

Figure 1 shows the z-curve for all 6,080 tests statistics. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 865 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the computation of the meta-statistics.

2. Visual inspection of the histogram shows a drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 38% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 38% EDR provides an estimate of the extent of selection for significance. The difference of ~30 percentage points is large in absolute terns, but relatively small in comparison to other psychology departments. The upper level of the 95% confidence interval for the EDR is 56%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (70% vs. 72%) is similar, but the EDR is higher (38% vs. 28%), suggesting less severe selection for significance for research published by faculty members at Western University included in this analysis.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 70% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 38% is closer to the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Western University, the ARP is (70 +38)/2 = 54%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, research from Western University is expected to replicate at the average rate of actual replication studies.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 38% implies that no more than 9% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 23%, allows for 18% false positive results. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 1% with an upper limit of the 95% confidence interval of 3%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at Western University.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results are amazing. The EDR increased to 70% with a relatively tight confidence interval ranging from 61% to 78%. The confidence interval does not overlap with the confidence interval for all-time z-scores. This makes Western only the second department that shows a statistically significant improvement in response to the replication crisis. Moreover, The ARP of 74% is the highest and much higher than some other departments.

There is variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Rachel M. Calogero 80 88 71 2
2 Melvyn A. Goodale 74 78 70 2
3 Daniel Ansari 67 77 58 4
4 John Paul Minda 64 75 54 5
5 Stefan Kohler 63 66 60 3
6 Ryan A. Stevenson 57 65 50 5
7 Debra J. Jared 54 76 33 11
8 Erin A. Heerey 52 64 41 8
9 Stephen J. Lupker 51 69 33 11
10 Ken McRae 51 69 32 11
11 Ingrid S. Johnsrude 49 75 22 18
12 Lorne Campbell 45 62 29 13
13 Marc F. Joanisse 45 67 24 17
14 Victoria M. Esses 39 52 26 15

2021 Replicability Report for the Psychology Department at McGill University

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

McGill University

I used the department website to find core members of the psychology department. I found (only) 20 faculty members at the associate (5) or full professor (15) level. The reason is that McGill is going through a phase of renewal and currently has a large number of assistant professors before tenure that are not included in these analyses (14). It will be interesting to see the replicability of research at McGill in five years when these assistant professors are promoted to the rank of associate professor.

Not all researchers conduct quantitative research and report test statistics in their result sections. Therefore, the analysis is limited to 10 faculty members that had at least 100 significant test statistics. Thus, the results are by no means representative of the whole department with 34 faculty members, but I had to follow the same criteria that I used for other departments.

Figure 1 shows the z-curve for all 3,000 tests statistics. This is a relatively small number of z-scores. Larger departments and departments with more prolific researchers can have over 10,000 test statistics. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 412 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the computation of the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 78% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 78% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of nearly 60 percentage points is the largest difference observed for any department analyzed so far (k = 11). The upper level of the 95% confidence interval for the EDR is 34%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (78% vs. 72%) is higher and the EDR (21% vs. 28%) is lower, suggesting more severe selection for significance for research published by McGill faculty members included in this analysis.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 66% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Columbia, the ARP is (66 + 21)/2 = 44%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, research from McGill University is expected to replicate at the average rate of actual replication studies.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 20% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 12%, allows for 39% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 9%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at Columbia. Of course, this criterion will be inappropriate for some researchers, but the present results show that the traditional alpha criterion of .05 is also inappropriate to maintain a reasonably low probability of false positive results.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results are disappointing. The point estimate, 18%, is even lower than for all year, 21%, although the difference could just be sampling error. Mostly, these results suggest that the psychology department at McGill University has not responded to the replication crisis in psychology, despite a low replication rate that provides more room for improvement. It will be interesting to see whether the large cohort of assistant professors adopted better research practices and will boost McGill’s standing in the replicability rankings of psychology departments.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Caroline Palmer 71 75 68 2
2 Kristine H. Onishi 60 69 51 5
3 Melanie A. Dirks 54 78 31 12
4 Richard Koestner 50 67 32 11
5 Yitzchak M. Binik 49 73 25 16
6 John E. Lydon 44 61 28 14
7 Jelena Ristic 44 69 18 24
8 Mark W. Baldwin 43 54 33 11
9 Jennifer A. Bartz 39 58 19 22
10 Blaine Ditto 17 27 7 67

2021 Replicability Report for the Psychology Department at Princeton University

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Princeton University

I used the department website to find core members of the psychology department. I found 24 professors and 4 associate professors. I used Web of Science to download references related to the authors name and initial. An r-script searched for related publications in the database of publications in 120 psychology journals.

Not all researchers conduct quantitative research and report test statistics in their result sections. Therefore, the analysis is limited to 17 faculty members that had at least 100 test statistics. This criterion eliminated many faculty members who publish predominantly in neuroscience journals.

Figure 1 shows the z-curve for all 6,199 tests statistics. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 710 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the computation of the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 40% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 401% EDR provides an estimate of the extent of selection for significance. The difference of~ 30 percentage points is large, but one of the smallest difference for investigations of psychology departments. The upper level of the 95% confidence interval for the EDR is 51%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (70% vs. 72%) is similar, but the EDR is higher (40% vs. 28%). Although this difference is not statistically significant, it suggests that the typical study at Princeton has slightly more power than studies in psychology in general.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 65% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Princeton, the ARP is (65 + 40)/2 = 52.5%. This is somewhat higher than the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, research from Columbia University is expected to replicate at a slightly higher rate than studies in psychology in general.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 40% implies that no more than 8% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 31%, allows for 12% false positive results. To lower the risk of a false positive result, it is possible to reduce the significance threshold to alpha = .005 (Benjamin et al., 2017). Figure 2 shows that implications of this new criterion (z = 2.8). The false positive risk is now 2% and even the upper limit of the 95% confidence interval is only 3%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at Princeton.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The point estimate of the EDR increased from 40% to 61%, but due to the relatively small number of observations this change is not statistically significant. It is also problematic that z-curve plot shows a higher frequency of z-scores between 2.2 and 2.4 rather than 2.0 and 2.2. While there are many reasons for this finding, one explanation could be that some researchers use a new criterion value for selection. Rather than publishing any p-value below .05, they may only publish p-values below .02, for example. This practice would bias the z-curve estimates that assume no further selection effects once a p-value is below .05.

The next figure shows the results for an analysis that excludes z-scores between 2 and 2.2 from the analysis. The main finding is that the EDR estimate drops from 61% to 25%. As a result, the FDR estimate increases from 3% to 16%. Thus, it is too early to conclude that Princeton’s research has become notably more replicable, and I would personally continue to use alpha = .005 to reject null-hypotheses.

10 of the 17 faculty members with useful data were classified as social psychologists. The following analysis is limited to the z-scores of these 10 faculty members to examine whether social psychological research is less replicable (Open Science Collaboration, 2015).

The EDR is slightly, but not significantly, lower, but still higher than the EDR of other departments. Thus, there is no evidence to suggest that social psychology at Princeton is less replicable than research in other areas. Other areas did not have sufficient test statistics for a meaningful analysis.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Uri Hasson 82 83 81 1
2 Kenneth A. Norman 73 75 71 2
3 Jordan A. Taylor 68 77 60 4
4 Elke U. Weber 66 70 63 3
5 Tania Lombrozo 65 73 58 4
6 Diana I. Tamir 59 69 50 5
7 Yael Niv 58 68 47 6
8 Emily Pronin 51 68 34 10
9 Jonathan D. Cohen 50 76 23 17
10 Alin Coman 49 57 40 8
11 Molly J. Crockett 49 75 23 17
12 J. Nicole Shelton 47 55 39 8
13 Susan T. Fiske 46 70 22 19
14 Stacey Sinclair 40 49 31 12
15 Eldar Shafir 35 55 14 31
16 Deborah A. Prentice 33 55 11 44
17 Joel Cooper 28 44 12 39

Are there maladaptive personality traits?

The concept of personality disorders has its roots in psychiatry. Wikipedia provides a clear definition of personality disorders.

Personality disorders (PD) are a class of mental disorders characterized by enduring maladaptive patterns of behavior, cognition, and inner experience, exhibited across many contexts and deviating from those accepted by the individual’s culture.

The definition of personality disorders shares many features with definitions of normal personality traits. Personality traits are enduring dispositions that produce cross-situational and cross-temporal consistency in behaviors, cognitions, and emotions. The key difference between personality disorders and personality traits is that personality disordered traits are assumed to be maladaptive, unhealthy, or deviant from societal and cultural norms.

The history of psychiatry and psychology shows how problematic it can be when a profession is allowed to define mental disorders, especially when they are defined as deviance from social norms. The fist Diagnostic Manual of the American Psychiatric Association included homosexuality as a mental disorder (wikipedia). This is now recognized as a mistake, and social progress aims to be more inclusive towards individuals who deviate from traditional cultural norms.

Progressive forces are also trying to change social norms regarding body types, skin color, and many other attributes that vary across individuals. However, some psychologists are working towards a comprehensive system of personality disorders that may create new stigmas for individuals with deviant personality traits. This is a dangerous trend that has not received enough attention. To be maladaptive, a personality trait should have clear negative effects on individuals’ health and well-being. This requires extensive validation research and careful examination of measures that could be used to diagnose personality disorders. For example, the CATP-PD project identified 33 distinct personality disorders (Simms et al., 2011), ranging from Anhedonia to Withdrawn personality disorder.

To study personality disorders, personality disorder researchers developed questionnaires that can be used to diagnose personality disorders. Studies with these measures showed that responses to items on personality disorder questionnaires are often strongly correlated with responses to items on measures of normal personality traits (Wright & Simmons, 2014). This raises concerns about the discriminant validity of measures that are designed to assess personality disorders. For example, normal measures of personality measure individual differences in trust. Some individuals are more trusting than others. Trust or distrust can be advantageous in different contexts. However, the CAT-PD includes a measure of mistrust as one of 33 personality disorders. The challenge for theories of personality disorders is to demonstrate that the mistrust scale does not merely measure normal variation in trust, but identifies maladaptive forms or levels of low trust. This leads to two statistical criteria that a valid measure of personality disorders should fulfill. First, variation in the personality disorder measure should be distinct from variation in normal personality. Second, the unique variance in the measure of personality disorders should predict symptoms of adaptation failures. As the key criterion for a mental disorder is suffering, maladaptive personality traits should predict lower well-being (e.g., internalizing or externalizing symptoms, lower life-satisfaction).

Another threat to the validity of personality disorder measures is that self-ratings are often influenced by the desirability of items. This response bias has been called halo bias, socially desirable responding, other-deception, faking, or self-enhancement. Ample evidence shows that self-ratings are influenced by halo bias. The strongest evidence for the presence of halo bias comes from multi-rater studies and studies that compare self-ratings to objective measures (Anusic et al., 2009). For example, whereas intelligence and attractiveness are practically uncorrelated, self-ratings show a positive correlation because some individuals exaggerate their attractiveness and intelligence. Halo bias also influences self-ratings of normal personality traits and well-being. As most personality disorder items are highly evaluative, it is likely that self-ratings of personality disorders are also contaminated by halo bias. Studies with self-ratings and informant ratings also suggest that self-ratings of personality disorders are distorted by response styles (Quilty, Cosentino, & Bagby, 2018). This could mean that honest responders are misdiagnosed as having a personality disorders, whereas self-enhancers are falsely diagnosed as not having a personality disorder.

To explore the validity of the CATP-PD scales as measures of personality disorders, I reanalyzed the data from Wright and Simms (2014) article. The dataset consists of ratings on the 30 facets of Costa and McCrae’s model of normal personality, the 33 scales of the CAT-PD, and another measure of personality disorders, the PID-5. The scale scores were subjected to an exploratory factor analysis with five factors. The factors were labeled Antagonism (Manipulativeness .71, Straightforwardness -70), Negative Affectivity (Anger .71, Trust -.52), Disinhibition (Irresponsibility .75, Self-Discipline -87), Detachment (Emotional Detachment .65, Positive Emotions -.59), and Psychoticism (Unusual Beliefs, .76, no strong negative loading). This model suggests that higher order factors of normal personality and personality disorders overlap. However, this model should not be taken too seriously because it has relatively low fit. There are several reasons for this low fit. First, exploratory factor analysis (EFA) often confounds substantive factors and method factors. Confirmatory factor analysis is often needed to separate method variance from substantive variance. Second, EFA cannot represent hierarchical structures in data. This is a problem because the Big Five are higher-order factors of basic personality traits called facets. it is possible that the 33 personality disorder scales are related to normal personality at both of these levels, but factor analysis assumes that all correlations are produced by shared variance with the higher-order Big Five factors. Finally, EFA does not allow for residual correlations among Big Five facets or personality disorder scales. All of these problems explain why an EFA model fails to fit the observed correlation matrix.

To provide a better test of the validity of the CATP-PD scales as measures of personality disorders, I performed a confirmatory factor analysis (CFA). CFA is a statistical tool. The name suggests that it can only be used for confirmatory analysis, but this is not true. CFA can also be used to explore models and then confirm these models in new datasets. It is not possible to use EFA for exploration because its limitations make it impossible to find a fitting model that could be subjected to a confirmatory test. As I have only one dataset, the results are exploratory and require confirmation with new data.

Exploratory Factor Analysis

Wright and Simms (2014) did not report standard fit statistics for their EFA model. Also, I limited the analysis to the (normal) personality scales and the CAT-PD scales. Thus, I ran an EFA with five factors to obtain fit indices that can be compared to those of the CFA model, chi2 (1,648) = 3993.93, CFI = .786, RMSEA = .048, SRMR = .071, AIC = 71,534.36, BIC = 73,449.10. While the RMSEA is below the standard criterion value of .06, CFI is well below the criterion value of .95. However, these criterion values are only suggestive. More important is the comparison of alternative models. A better model should have better fit to the data, especially for fit indices that reward parsimony (CFI, RMSEA, AIC, BIC).

Confirmatory Factor Analysis

The final model was constructed in several steps starting with the model for normal personality. The theoretical model assumes six independent factors, the Big Five and a halo factor. Primary loadings of the 30 facets on the Big Five were specified according to Costa and McCrae’s theory and free. Loadings on the halo factors were initially constrained to 1, but freed in the final model. Secondary loadings were added if modification indices were greater than 20 and if they were interpretable. Finally, residual correlations were added if modification indices were above 20. The final Big Five plus Halo model (B5+H) had good fit, chi2 (314) = 556.89, CFI = .951, RMSEA = .045, SRMR = .059, AIC = 54763.01, BIC = 55,354.43. This finding already shows a problem with the 5-factor EFA solution. The EFA model failed to identify a separate openness factor, even though openness is clearly present in the data.

I then explored how the 33 CAT-PD scales are related to the 36 personality predictors; that is, the 30 facets, the Big Five, and the halo factor. I always allowed for halo effects, even if they were not significant, but I only used significant (p < .01) personality predictors. After these exploratory analysis, I fitted a model with all 33 CAT-PD scales. This required modeling relationships among CAT-PD scales that are not explained by the personality predictors (residual relationships). This led me to specify a CAT-PD method factor that resembles the fifth factor in the EFA model. The final model did not have any modification indices greater than 20.

The final model had about the same number of parameters (1,651 vs. 1,648 degrees of freedom), but the CFA model had better fit, chi2 (1,651) = 3145.88, CFI = .866, RMSEA = .038, SRMR = .069, AIC = 70,514.85, BIC = 72,416.26.

The key results are summarized in Table 1. It shows the loadings of the 33 CAT-PD scales on the seven factors. The CAT-PD scales are sorted in order of their primary loadings. The excel file can be downloaded here (results).

Eight scales had their primary loading on the Halo (evaluative) factor. Some of these loadings were very strong. For example, halo explained more than 50% of the variance in callousness scores (-.83^2 = 69%). In contrast, relationships to the five factors of normal personality were relatively weak. Only five loadings exceeded a value of .3 (9% explained variance). The strongest relationship was the loading of domineering on neuroticism (.40^2 = 16%). These scales also showed no notable loadings on the CAT factor that reflects shared variance among CAT scales. Some scales had additional relationships with specific facets of normal variation in personality. Most notably, domineering was predicted by the unique variance in assertiveness, .38^2 = .14%.

It is difficult to reconcile these findings with the conceptualization of these CAT scales as measures of maladaptive personality variation. The primary loadings on the halo factor will produce correlations with measures of adaptation that also rely on self-ratings that are influenced by halo bias (e.g., life satisfaction), but it is doubtful that these scales could predict maladaptive outcomes that are measured with independent measures. Scales that are related to neuroticism are expected to show negative relationships with measures of well-being, but it is not clear that they would predict unique variance after controlling for the known effects of neuroticism, especially the depressiveness facet of neuroticism.

My predictions about the relationship of these CAT scales and measures of well-being need to be tested with actual data, but the present findings show that studies that rely exclusively on self-ratings can be biased if the influence of halo bias on self-ratings is ignored.

The next 12 scales have their primary loading on the neuroticism factor of variation in normal personality. Some of these loadings are very high. For example, neuroticism explains .80^2 = 64% of the variance in scores on the Affective Lability scale. It is well-known that high levels of neuroticism are a risk-factor for mood disorders, especially during times of high stress. It can be debated whether this makes high neuroticism a personality disorders. Alternatively, the diathesis-stress model would argue that neuroticism is a disposition that only becomes maladaptive in combination with environmental factors. However, even if neuroticism were considered a personality disorder, it is not clear whether the 12 CAT-PD scales with primary loadings on neuroticism add to our understanding of mental health problems and mood disorders. An alternative perspective is that the CAT-PD scales merely reflect different manifestations of mood disorders. This could be tested by examining how scores on these scales respond to treatment of mood disorders. It is also noteworthy that the CAT-PD failed to identify personality disorders that are related to maladaptive low levels of negative affect. Especially, the absence of negative emotions that can inhibit behaviors such as guilt or anxiety could be maladaptive.

Five CAT-PD scales had their primary loading on the extraversion factor of normal personality. Exhibitionism was the only scale with a positive loading. The loading was high and suggested that Extraversion explained 59% of the variance. Studies of extraversion and well-being typically show that higher levels of extraversion predict higher well-being. This suggests that any maladaptive effects of exhibitionism are related to the remaining unexplained variance. Thus, it is questionable that high levels of extraversion should be considered a personality disorder.

Anhedonia and Social withdrawal are related to low Extraversion. As for CAT-PD scales related to neuroticism, it is not clear that these scales add something to our understanding of personality and mental health. Introversion itself may not be considered maladaptive. Rather, introversion may be a disposition that is only maladaptive under specific circumstances. Furthermore, causality may be reversed. Treatment of mood disorders with medication increases extraversion scores.

Extraversion explains only 10% of the variance in romantic disinterest. Thus, this scale is hardly related to variation in normal personality. Moreover, it is not clear why opposite tendencies such as hypersexuality are missing.

None of the 33 CAT-PD scales have a primary loading on Openness. This is surprising because high Openness is sometimes associated with both being a genius and being detached from reality. In any case, there is no evidence to suggest that normal variation in Openness is maladaptive.

Even more surprising is the finding that none of the 33 CAT-PD scales have their primary loadings on agreeableness. The main reasons is that disagreeableness scales are strongly influenced by socially desirable responding. Thus, more effort needs to be made to measure these constructs without halo bias. Informant ratings might be useful to obtain better measures of these constructs.

Four CAT-PD scales have their primary loadings on conscientiousness. Two had positive loadings, namely Perfectionism (.56^2 = 31%) explained variance and workaholism (.51^2 = 26% explained variance). It is not clear that this justifies considering high conscientiousness a personality disorder. Rather, high conscientiousness could be a risk factor that is only maladaptive in specific environments or it could reflect disorders that express themselves in different ways for individuals with different personality traits. For example, workaholism might be a specific maladaptive way to cope with negative affect that is not different from alcoholism or other addictions.

The last four CAT-PD scales have their primary loadings on the CAT-scale specific factor. Thus, they do not show strong overlap with normal personality. The high loading for unusual experiences might tap into cognitive symptoms related to psychoticism. Self-harm shows weak loadings on all factors and it is not clear that it measures a personality trait rather than a symptom of distress.

In conclusion, these results provide little support for the hypothesis that there is a large number of personality disorders. The main link between models of normal variation in personality and personality disorder scales is neuroticism. It is well-known that high levels of neuroticism predict lower well-being and are risk-factors for mood disorders. However, the remaining variation in personality is not consistently related to proposed measures of personality disorders.

Comparing the Results to the EFA Results

The better fitting 7-factor CFA model differs notably from the 5-factor EFA model in the original publication. The EFA model identified a single factor called Antagonism. This factor blends the halo factor and the Agreeableness factor of normal personality. As shown in the CFA model, CAT-PD scales load on the halo factor, whereas less evaluative scales of normal personality reflect Agreeableness. The reason EFA missed this distinction is that there are only six agreeableness measures and some of them have rather low loadings on agreeableness. As EFA is strongly influenced by the number of indicators for a factor, EFA failed to identify the agreeableness factor and the halo factor was misinterpreted as antagonism because antagonistic traits are highly undesirable.

The second factor was called Negative Affectivity and corresponds to the Neuroticism factor in the CFA model.

The third factor was called Disinhibition and corresponds more or less to the Conscientiousness factor in the CFA model.

The fourth factor was called Detachment and corresponds to Extraversion.

The fifth factor was called Psychoticism, which can be confusing because this is also a term used by Eysenck for variation in normal personality. This factor is similar to the CAT-specific factor in the CFA. Thus, it does not represent a dimension of normal personality.

Finally, the EFA model failed to represent Openness as a distinct factor for the same reason it failed to show a distinct agreeableness factor. As Openness is not related to PD-scales, the six Openness facets were simply not enough to form a distinct factor in a model limited to five factors.

In sum, the main problem of the EFA model is that it failed to identify agreeableness and openness as dimensions of normal personality. It therefore does not represent the well-known five factor structure of normal personality. This makes it impossible to examine how variation in normal personality is related to scales that are intended to measure personality disorders. Another problem is that EFA fails to separate method variance and content variance. In short, EFA is an inferior tool to study the relationship between measures of normal personality and allegedly maladaptive traits. The CFA model proposed here can serve as a basis for future exploration of this question, ideally with multi-rater data to separate method variance from content variance.

References

Simms, L. J., Goldberg, L. R., Roberts, J. E., Watson, D., Welte, J., & Rotterman, J. H. (2011). Computerized adaptive assessment of personality disorder: Introducing the CAT–PD Project. Journal of Personality Assessment, 93, 380–389. doi:10.1080/00223891.2011.577475

 

Replicability Rankings of Psychology Departments

Introduction

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Department Rankings

The main results of the replicability analysis are included in this table. Detailed analyses of departments and faculty members can be found by clicking on the hyperlink of a university.

The table is sorted by the all time actual replication prediction (ARP). It is easy to sort the table by other meta-statistics.

The ERR is the expected replication rate that is estimated based on the average power of studies with significant results (p < .05).

The EDR is the expected discovery rate that is estimated based on the average power of studies before selection for significance. It is estimated using the distribution of significant p-values converted into z-scores.

Bias is the discrepancy between the observed discovery rate (i.e., the percentage of significant results in publications) and the expected discovery rate. Bias reflects the selective reporting of significant results.

The FDR is the false discovery risk. It is estimated using Soric’s formula that converts the expected discovery rate into an estimate of the maximum percentage of false positive results under the assumption that true hypothesis are tested with 100% power.

For more information about these statistics, please look for tutorials or articles on z-curve on this blog.

University ARP-All ERR-ALL EDR-ALL Bias-All FDR-All ARP-5Y ERR-5Y EDR-5Y Bias-5Y FDR-5Y    
University of Michigan 55 69 41 31 8 58.5 72 45 27 6
Western University 54.5 70 39 29 8 73.5 77 70 1 2
University of Toronto 54 67 41 28 8 56 69 43 24 7
Princeton University 52.5 65 40 30 8 67.5 74 61 8 3
Harvard University 48 69 27 40 14 55 68 42 22 7
Yale University 48 65 31 38 12 55 70 40 31 8
University Texas – Austin 46.5 66 27 44 14 55.5 70 41 24 8
University of British Columbia 44 67 21 47 20 47 65 29 34 13
McGill University 43.5 66 21 57 20 43.5 69 18 57 23
Columbia University 41.5 62 21 49 19 39 61 17 50 26
New York University 41 62 20 50 20 48 70 26 43 15
Stanford University 41 60 22 45 18 58 66 50 15 5

2021 Replicability Report for the Psychology Department at Columbia University

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Columbia University

A research assistant, Dellania Segreti, used the department website to find core members of the psychology department. She found 11 professors and 2 associate professors. This makes Columbia U one of the smaller psychology departments. She used Web of Science to download references related to the authors name and initial. An r-script searched for related publications in the database of publications in 120 psychology journals.

Not all researchers conduct quantitative research and report test statistics in their result sections. Therefore, the analysis is limited to 10 faculty members that had at least 100 significant test statistics. This criterion eliminated many faculty members who publish predominantly in neuroscience journals.

Figure 1 shows the z-curve for all 7,776 tests statistics. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 934 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the computation of the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of~ 50 percentage points is large, and among the largest differences of psychology departments analyzed so far. The upper level of the 95% confidence interval for the EDR is 31%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (70% vs. 72%) is similar, but the EDR (21% vs. 28%) is lower, although the difference is not statistically significant and could just be sampling error.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 62% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Columbia, the ARP is (62 + 21)/2 = 42%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, research from Columbia University is expected to replicate at the average rate of actual replication studies.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 19% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 13%, allows for 38% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 4% with an upper limit of the 95% confidence interval of 9%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at Columbia. Of course, this criterion will be inappropriate for some researchers, but the present results show that the traditional alpha criterion of .05 is also inappropriate to maintain a reasonably low probability of false positive results.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results are disappointing. The point estimate is even lower than for all year, although the difference could just be sampling error. Mostly, these results suggest that the psychology department at Columbia University has not responded to the replication crisis in psychology, despite a low replication rate that provides more room for improvement. The ARP of 39% for research published since 2016 places Columbia University at the bottom of universities analyzed so far.

Only one area had enough researchers to conduct an area-specific analysis. The social area had 6 members with useable data. The z-curve shows a slightly lower EDR than the z-curve for all 10 faculty members, although the difference is not statistically significant. The low EDR for the department is partially due to the high percentage of social faculty members with useable data.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Jonathan B. Freeman 84 86 82 1
2 Janet Metcalfe 77 80 73 2
3 Kevin N. Ochsner 50 66 33 11
4 Lila Davachi 48 75 21 20
5 Dima Amso 44 63 25 16
6 Niall Bolger 43 57 30 12
7 Geraldine A. Downey 37 58 17 26
8 E. Tory Higgins 34 53 15 29
9 Nim Tottenham 34 50 17 25
10 Valerie Purdie 34 40 27 14

2021 Replicability Report for the Psychology Department at U Texas – Austin

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of Texas – Austin

I used the department website to find core members of the psychology department. I counted 35 professors and 6 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 20 professors and 3 associate professors who had at least 100 significant test statistics. As noted above, this eliminated many faculty members who publish predominantly in neuroscience journals.

Figure 1 shows the z-curve for all 10,679 tests statistics in articles published by 23 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,559 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 27% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 71% ODR and a 27% EDR provides an estimate of the extent of selection for significance. The difference of~ 45 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 39%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (71% vs. 72%) and the EDR (27% vs. 28%) are very similar to the average for 120 psychology journals.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 66% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 27% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For UT Austin, the ARP is (66 + 27)/2 = 47%. This is just a bit above the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, UT Austin results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 22% implies that no more than 18% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 31% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 2% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by psychology researchers at UT Austin. Of course, this criterion will be inappropriate for some researchers, but the present results show that the traditional alpha criterion of .05 is also inappropriate to maintain a reasonably low probability of false positive results.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).

The results show an improvement. The EDR increased from 27% to 41%, but the confidence intervals are too wide to infer that this is a systematic change. The false discovery risk dropped to 8%, but due to the smaller sample size the upper limit of the 95% confidence interval is still 19%. Thus, it would be premature to lower the significance level at this point. notable improvement. The muted response to the replication crisis is by no means an exception. Rather, currently the exception is Stanford University that has shown the only significant increase in the EDR.

Only one area had enough researchers to conduct an area-specific analysis. The social area had 8 members with useable data. The z-curve is similar to the overall z-curve. Thus, there is no evidence that social psychology at UT Austin has lower replicability than other areas.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP ERR EDR FDR
1 Chen Yu 74 77 71 2
2 Yvon Delville 71 76 66 3
3 K. Paige Harden 62 69 54 4
4 Cristine H. Legare 58 73 42 7
5 James W. Pennebaker 55 63 47 6
6 William B. Swann 53 75 32 11
7 Bertram Gawronski 52 74 29 13
8 Jessica A. Church 51 77 24 17
9 David M. Buss 48 76 20 21
10 Jasper A. J. Smits 45 57 33 11
11 Michael J. Telch 45 67 23 17
12 Hongjoo J. Lee 44 70 18 24
13 Cindy M. Meston 44 55 34 10
14 Jacqueline D. Woolley 42 66 18 24
15 Christopher G. Beevers 41 62 20 21
16 Marie H. Monfils 41 67 16 27
17 Samuel D. Gosling 38 53 22 18
18 Arthur B. Markman 38 59 18 24
19 David S. Yeager 38 48 28 14
20 Robert A. Josephs 37 46 28 14
21 Jennifer S. Beer 36 51 20 21
22 Frances A. Champagne 34 55 13 34
23 Marlone D. Henderson 27 34 19 22

2021 Replicability Report for the Psychology Department at Stanford 

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Stanford University

I used the department website to find core members of the psychology department. I counted 19 professors and 6 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 13 professors and 3 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 26 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,344 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 22% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 67% ODR and a 22% EDR provides an estimate of the extent of selection for significance. The difference of~ 45 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 30%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (67% vs. 72%) and the EDR (22% vs. 28%) are somewhat lower, suggesting that statistical power is lower in studies from Stanford.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 60% suggests a fairly high replication rate. The problem is that actual replication rates are lower than the ERR predictions (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For Stanford, the ARP is (60 + 22)/2 = 41%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, Stanford results are expected to replicate at the average rate of psychological research.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 22% implies that no more than 18% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 31% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 4% with an upper limit of the 95% confidence interval of 8%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Stanford University.

Comparisons of research areas typically show lower replicability for social psychology (OSC, 2015), and Stanford has a large group of social psychologists (k = 10). However, the results for social psychologists at Stanford are comparable to the results for the entire faculty. Thus, the relatively low replicability of research from Stanford compared to other departments cannot be attributed to the large contingent of social psychologists.

Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year. The results show a marked improvement. The expected discovery more than doubled from 22% to 50%, and this increase is statistically significant. (So far, I have analyzed only 7 departments, but this is the only one with a significant increase yet). The high EDR reduces the false positive risk to a point estimate of 5% and an upper limit of the 95% confidence interval of 9%. Thus, for newer research, most of the results that are statistically significant with the conventional significance criterion of .05 are likely to be true effects. However, effect sizes are still going to be inflated because selection for significance with modest power results in regression to the mean. Nevertheless, these results provide first evidence of positive change at the level of departments. It would be interesting to examine whether these changes are due to individual efforts of researchers or reflect systemic changes that have been instituted at Stanford.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 16 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Jamil Zaki 69 73 65 3
2 James J. Gross 63 69 56 4
3 Jennifer L. Eberhardt 54 66 41 7
4 Jeanne L. Tsai 51 66 36 9
5 Hyowon Gweon 50 62 39 8
6 Michael C. Frank 47 70 23 17
7 Hazel Rose Markus 47 65 29 13
8 Noah D. Goodman 46 72 20 22
9 Ian H. Gotlib 45 65 26 15
10 Ellen M. Markman 43 62 25 16
11 Carol S. Dweck 41 58 24 17
12 Claude M. Steele 37 52 21 20
13 Laura L. Carstensen 35 57 13 37
14 Benoit Monin 33 53 13 36
15 Geoffrey L. Cohen 29 46 13 37
16 Gregory M. Walton 29 45 14 33

2021 Replicability Report for the Psychology Department at UBC (Vancouver) 

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of British Columbia (Vancouver)

I used the department website to find core members of the psychology department. I counted 34 professors and 7 associate professors. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 22 professors and 4 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 26 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,531 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 21% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 68% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 68% ODR and a 21% EDR provides an estimate of the extent of selection for significance. The difference of~ 50 percentage points is large. The upper level of the 95% confidence interval for the EDR is 29%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (68% vs. 72%) is similar, but the EDR is lower (31% vs. 21%). This suggest that the research produced by UBC faculty members is somewhat less replicable than research in general.

4. The z-curve model also estimates the average power of the subset of studies with significant results (p < .05, two-tailed). This estimate is called the expected replication rate (ERR) because it predicts the percentage of significant results that are expected if the same analyses were repeated in exact replication studies with the same sample sizes. The ERR of 67% suggests a fairly high replication rate. The problem is that actual replication rates are lower (about 40% Open Science Collaboration, 2015). The main reason is that it is impossible to conduct exact replication studies and that selection for significance will lead to a regression to the mean when replication studies are not exact. Thus, the ERR represents the best case scenario that is unrealistic. In contrast, the EDR represents the worst case scenario in which selection for significance does not select more powerful studies and the success rate of replication studies is not different from the success rate of original studies. The EDR of 21% is lower than the actual replication success rate of 40%. To predict the success rate of actual replication studies, I am using the average of the EDR and ERR, which is called the actual replication prediction (ARP). For UBC research, the ARP is 44%. This is close to the currently best estimate of the success rate for actual replication studies based on the Open Science Collaboration project (~40%). Thus, UBC results are expected to replicate at the average rate of psychological research.

5. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 21% implies that no more than 20% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 13%, allows for 34% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 7%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of UBC.

The next analyses examine area as a potential moderator. Actual replication studies suggest that social psychology has a lower replication rate than cognitive psychology, whereas the replicability of other areas is currently unknown (OSC, 2015). UBC has a large group of social psychologists with enough data to conduct a z-curve analysis (k = 9). Figure 3, shows the z-curve for the pooled data. The results show no notable difference to the z-curve for the department in general.

The only other area with at least five members that provided data to the overall z-curve was developmental psychology. The results are similar, although the EDR is a bit higher.

The last analysis examined whether research practices changed in response to the credibility crisis and evidence of low replication rates (OSC, 2015). For this purpose, I limited the analysis to articles published in the past 5 years. The EDR increased, but only slightly (29% vs. 21%) and not significantly. This suggests that research practices have not changed notably.

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 26 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

EDR = expected discovery rate (mean power before selection for significance)
ERR = expected replication rate (mean power after selection for significance)
FDR = false positive risk (maximum false positive rate, Soric, 1989)
ARP = actual replication prediction (mean of EDR and ERR)

Rank   Name ARP EDR ERR FDR
1 Darko Odic 67 73 62 3
2 Steven J. Heine 64 80 47 6
3 J. Kiley Hamlin 60 62 58 4
4 Lynn E. Alden 57 71 43 7
5 Azim F. Shariff 55 69 40 8
6 Andrew Scott Baron 54 72 35 10
7 James T. Enns 53 75 31 12
8 Catharine A. Winstanley 53 53 53 5
9 D. Geoffrey Hall 49 76 22 19
10 Elizabeth W. Dunn 48 58 37 9
11 Alan Kingstone 48 74 23 18
12 Jessica L. Tracy 47 67 28 14
13 Sheila R. Woody 46 63 30 13
14 Jeremy C. Biesanz 45 61 28 13
15 Kristin Laurin 43 59 26 15
16 Luke Clark 42 63 21 20
17 Frances S. Chen 41 64 17 25
18 Mark Schaller 41 57 24 16
19 Kalina Christoff 40 63 16 27
20 E. David Klonsky 39 50 27 14
21 Ara Norenzayan 38 62 15 30
22 Toni Schmader 37 56 18 25
23 Liisa A. M. Galea 36 59 13 36
24 Janet F. Werker 35 49 20 21
25 Todd C. Handy 30 47 12 39
26 Stan B. Floresco 26 43 9 55

2021 Replicability Report for the Psychology Department at Yale

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

Yale

I used the department website to find core members of the psychology department. I counted 13 professors and 4 associate professors, which makes it one of the smaller departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 12 professors and 1 associate professors who had at least 100 significant test statistics.

Figure 1 shows the z-curve for all 13,147 tests statistics in articles published by these 18 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,178 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (solid red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The statistically significant results (including z > 6) make up 31% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 31% EDR provides an estimate of the extent of selection for significance. The difference of~ 40 percentage points is fairly large. The upper level of the 95% confidence interval for the EDR is 42%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR (69% vs. 72%) and the EDR (31% vs. 28%) are similar. This suggest that the research produced by Yale faculty members is neither more nor less replicable than research produced at other universities.

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 31% implies that no more than 12% of the significant results are false positives, but the lower limit of the 95%CI of the EDR, 18%, allows for 24% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by researchers in the psychology department of Harvard University.

Given the small size of the department, it is not very meaningful to conduct separate analyses by area. However, I did conduct a z-curve analysis of articles published after 2014 to examine whether research at Yale has changed in response to the call for improvements in research practices. The results show an increase in the expected discovery rate from 31% to 43%, although the confidence intervals still overlap. Thus, it is not possible to conclude at this moment that this is a real improvement (i.e., it could just be sampling error). The expected replication rate also increased slightly from 65% to 72%. Thus, there are some positive trends, but there is still evidence of selection for significance (ODR 71% vs. EDR = 43%).

There is considerable variability across individual researchers, although confidence intervals are often wide due to the smaller number of test statistics. The table below shows the meta-statistics of all 13 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1 Tyrone D. Cannon 69 73 64 3
2 Frank C. Keil 67 77 56 4
3 Yarrow Dunham 53 71 35 10
4 Woo-Kyoung Ahn 52 73 31 12
5 B. J. Casey 52 64 39 8
6 Nicholas B. Turk-Browne 50 66 34 10
7 Jutta Joorman 49 64 35 10
8 Brian J. Scholl 46 69 23 18
9 Laurie R. Santos 42 66 17 25
10 Melissa J. Ferguson 38 62 14 34
11 Jennifer A. Richeson 37 49 26 15
12 Peter Salovey 36 57 15 30
13 John A. Bargh 35 56 15 31