Implicit Bias ? Unconscious Bias


The journal Psychological Inquire publishes theoretical articles that are accompanied by commentaries. In a recent issue, prominent implicit cognition researchers discussed the meaning of the term implicit. This blog post differs from the commentaries by researchers in the field, by providing an outsider perspective and by focusing on the importance of communicating research findings clearly to the general public. This purpose of definitions was largely ignored by researchers who are more focused on communicating with each other than with the general public. I will show that this unique outsider perspective favors a definition of implicit bias in terms of the actual research that has been conducted under the umbrella of implicit social cognition research rather than proposing a definition that renders 30 years of research useless with a simple stroke of a pen. If social cognition researchers want to communicate about implicit bias as empirical scientists they have to define implicit bias as effects of automatically activated information (associations, stereotypes, attitudes) on behavior. This is what they have studied for 30 years. Defining implicit bias as unconscious bias is not helpful because 30 years of research have failed to provide any evidence that people can act in a biased way without awareness. Although unconscious biases may occur, there is currently no scientific evidence to inform the public about unconscious biases. While the existing research on automatically activated stereotypes and attitudes has problems, the topic remains important. As the term implicit bias has caught on, it can be used in communications with the public about, but it should be made clear that implicit does not mean unconscious.


Psychologists are notoriously sloppy with language. This leads to misunderstandings and unnecessary conflicts among scientists. However, the bigger problem is a break-down in communication with the general public. This is particularly problematic in social psychology because research on social issues can influence public discourse and ultimately policy decisions.

One of the biggest case-studies of conceptual confusion that had serious real-world consequences is the research on implicit cognition that created the popular concept of implicit bias. Although the term implicit bias is widely used to talk about racism, the term lacks clear meaning.

The Stanford Encyclopedia of Philosophy defines implicit bias as a tendency to “act on the basis of prejudice and stereotypes without intending to do so.” However, lack of intention (not wanting to) is only one of several meanings of the term implicit. Another meaning of the word implicit is automatic activation of thoughts. For example, a Scientific American article describes implicit bias as a “tendency for stereotype-confirming thoughts to pass spontaneously through our minds.” Notably, this definition of implicit bias clearly implies that people are aware of the activated stereotype. The stereotype-confirming thought is in people’s mind and not activated in some other area of the brain that is not accessible to consciousness. This definition also does not imply that implicit bias results in biased behavior because awareness makes it possible to control the influence of activated stereotypes on behavior.

Merriam Webster Dictionary offers another definition of implicit bias as “a bias or prejudice that is present but not consciously held or recognized.” In contrast to the first two meanings of implicit bias, this definition suggests that implicit bias may occur without awareness; that is implicit bias = unconscious bias.

The different definitions of implicit bias lead to very different explanations of biased behavior. One explanation assumes that implicit biases can be activated and guide behavior without awareness and individuals who act in a biased way may either fail to recognize their biases or make up some false explanation for their biased behaviors after the fact. This idea is akin to Freud’s notion of a powerful, autonomous unconscious (the Id) that can have subversive effects on behavior that contradict the values of a conscious, moral self (Super-Ego). Given the persistent influence of Freud on contemporary culture, this idea of implicit bias is popular and reinforced by the Project Implicit website that offers visitors tests to explore their hidden (hidden = unconscious) biases.

The alternative interpretation of implicit bias is less mysterious and more mundane. It means that our brain constantly retrieves information from memory that is related to the situation we are in. This process does not have a filter to retrieve only information that we want. As a result, we sometimes have unwanted thoughts. For example, even individuals who do not want to be prejudice will sometimes have unwanted stereotypes and associated negative feelings pop into their mind (Scientific American). No psychoanalysis or implicit test is needed to notice that our memory has stored stereotypes. In safe contexts, we may even laugh about them (Family Guy). In theory, awareness that a stereotype was activated also makes it possible to make sure that it does not influence behavior. This may even be the main reason for our ability to notice what our brain is doing. Rather than acting in a reflexive way to a situation, awareness makes it possible to respond more flexible to a situation. When implicit is defined as automatic activation of a thought, the distinction between implicit and explicit bias becomes minor and academic because the processes that retrieve information information from memory are automatic. The only difference between implicit and explicit retrieval of information is that the process may be triggered spontaneously by something in our environment or by a deliberate search for information.

After more than 30 years of research on implicit cognitions (Fazio, Sanbonmatsu, Powell, Kardes, 1986), implicit social cognition researchers increasingly recognize the need for clearer definitions of the term implicit (Gawronski, Ledgerwood, & Eastwick, 20222a), but there is little evidence that they can agree on a definition (Gawronski, Ledgerwood, & Eastwick, 20222b). Gawronski et al. (2022a, 2022b) propose to limit the meaning of implicit bias to unconscious biases; that is, individuals are unaware that their behavior was influenced by activation of negative stereotypes or affects/attitudes. “instances of bias can be described as implicit if respondents are unaware of the effect of social category cues on their behavioral response” (p. 140). I argue that this definition is problematic because there is no scientific evidence to support the hypothesis that prejudice is unconscious. Thus, the term cannot be used to communicate scientific results that have been obtained by implicit cognition researchers over the past three decades because these studies did not study unconscious bias.

Implicit Bias Is Not Unconscious Bias

Gawronski et al. note that their decision to limit the term implicit to mean unconscious is arbitrary. “A potential objection against our arguments might be that they are based on a particular interpretation of implicit in IB that treats the term as synonymous with unconscious” (p. 145). Gawronski et al. argue in favor of their definition because “unconscious biases have the potential to cause social harm in ways that are fundamentally different from conscious biases that are unintentional and hard-to-control” (p. 146). The key words in this argument is “have the potential,” which means that there is no scientific evidence that shows different effects of biases with and without awareness of bias. Thus, the distinction is merely a theoretical, academic one without actual real-world implications. Gawronski et al. agree with this assessment when they point out that existing implicit cognition research “provides no information about IB [implicit bias] if IB is understood as an unconscious effect of social category cues on behavioral responses. It seems bizarre to define the term implicit bias in a way that makes all of the existing implicit cognition research irrelevant. A more reasonable approach would be to define implicit bias in a way that is more consistent with the work of implicit bias researchers. As several commentators pointed out, the most widely used meaning of implicit is automatic activation of information stored in memory about social groups. In fact, Gawronski himself used the term implicit in this sense and repeatedly pointed out that implicit does not mean unconscious (i.e., without awareness) (Appendix 1).

Defining the term implicit as automatic activation makes sense because the standard experimental procedure to study implicit cognition is based on presenting stimuli (words, faces, names) related to a specific group and to examine how these stimuli influence behaviors such as the speed of pressing a button on a keyboard. The activation of stereotypic information is automatic because participants are not told to attend to these stimuli or even to ignore them. Sometimes the stimuli are also presented in subtle ways to make it less likely that participants consciously attend to them. The question is always whether these stimuli activate stereotypes and attitudes stored in memory and how activation of this information influences behavior. If behavior is influenced by the stimuli, it suggests that stereotypic information was activated – with or without awareness. The evidence from studies like these provides the scientific basis for claims about implicit bias. Thus, implicit bias is basically operationally defined as systematic effects of automatically activated information about groups on behavior.

The aim of implicit bias research is to study real-word incidences of prejudice under controlled laboratory conditions. A recent incidence at racism shows how activation of stereotypes can have harmful consequences for victims and perpetrators of racist behavior .

University of Kentucky student who repeatedly hurled racist slur at Black student permanently banned from campus

The question of consciousness is secondary. What is important is how individuals can prevent harmful consequences of prejudice. What can individuals do to avoid storing negative stereotypes and attitudes in the first place? What can individuals do to weaken stored memories and attitudes? What can individuals do to make it less likely that stereotypes are activated? What can individuals do to control the influence of attitudes when they are activated? All of these questions are important and are related to the concept of implicit as automatic activation of attitudes. The only reason to emphasize unconscious process would be a scenario where individuals are unable to control the influence of information that influences behavior without awareness. However, given the lack of evidence that unconscious biases exist, it is currently unnecessary to focus on this scenario. Clearly, many instances of biases occur with awareness (“White teacher in Texas fired after telling students his race is ‘the superior one’”).

Unfortunately, it may be surprising for some readers to learn that implicit does not mean unconscious because the term implicit bias has been popularized in part to make a distinction between well-known forms of bias and prejudice and a new form of bias that can influence behavior even when individuals are consciously trying to be unbiased. These hidden biases occur against individuals’ best intentions because they exist in a blind spot of consciousness. This meaning of implicit bias was popularized by Banaji and Greenwald (2013), who also founded the Project Implicit website that provides individuals with feedback about their hidden biases; akin to psychoanalysts who can recover repressed memories.

Gawronski et al. (2022b) point out that Greenwald and Banaji’s theory of unconscious bias evolved independently of research by other implicit bias researchers who focused on automaticity and were less concerned about the distinction between conscious and unconscious biases. Gawronski’s definition of implicit bias as unconscious bias favors Banaji and Greenwald’s school of thought (hidden bias) over other research programs (automatically activated biases). The problem with this decision is that Greenwald and Banaji recently walked back their claims about unconscious biases and no longer maintain that the effects they studies were obtained without awareness (Implicit = Indirect & Indirect ? Unconscious, Greenwald & Banaji, 2017). The reversal of their theoretical position is evident in their statement that “even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning [unconscious vs. conscious], they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 892). It is puzzling to see Gawronski arguing for a definition that is based on a theory that the authors no longer endorse. Given the lack of scientific evidence that stereotypes regularly lead to biases without awareness, this might be the time to agree on a definition that matches the actual research by implicit cognition researchers, and the most fitting definition would be automatic activation of stereotypes and attitudes, not unconscious causes of behavior.

Gawronski et al. (2022a) also falsely imply that implicit cognition researchers have ignored the distinction between conscious and unconscious biases. In reality, numerous studies have tried to demonstrate that implicit biases can occur without awareness. To study unconscious biases, social cognition researchers have relied heavily on an experimental procedure known as subliminal priming. In a subliminal priming study, a stimulus (prime) is presented very briefly, outside of the focus of attention, and/or with a masking stimuli. If a manipulation check shows that individuals have no awareness of the prime and the prime influences behavior, the effect appears to occur without awareness. Several studies suggested that racial primes can influence behavior without awareness (Bargh et al., 1996; Davis, 1989).

However, the credibility of these results has been demolished by the replication crisis in social psychology (Open Science Collaboration, 2015; Schimmack, 2020). Priming research has been singled out as the field with the biggest replication problems (Kahneman, 2012). When asked to replicate their own findings, leading priming researchers like Bargh refused to do so. Thus, while subliminal priming studies started the implicit revolution (Greenwald & Banaji, 2017), the revolution imploded over the past decade when doubts about the credibility of the original findings increased.

Unfortunately, researchers within the field of implicit bias research often ignore the replication crisis and cite questionable evidence as if it provided solid evidence for unconscious biases. For example, Gawronski et al. (2022b) suggest that unconscious biases may contribute to racial disparities in use-of-force errors such as the high-profile killing of Philando Castile. To make this case, they use a (single) study of 58 White undergraduate students (Correll, Wittenbrink, Crawford, & Sadler, 2015, Study 3). The study asked participants to make shoot vs. no-shoot decisions in a computer task (game) that presented pictures of White or Black men holding a gun or another object. Participants were instructed to make one quick decision within 630 milliseconds and another decision without time restriction. Gawronski et al. suggest that failures to correct an impulsive error given ample time to do so constitutes evidence of unconscious bias. They summarized the results as evidence that “unconscious effects on basic perceptual processes play a major role in tasks that more closely resemble real-world settings” (p. 226).

Fact checking reveals that this characterization of the study and its results is at least misleading, if not outright false. First, it is important to realize that the critical picture was presented for only 175ms and immediately replaced by another picture to wipe out visual memory. Although this is not a strictly subliminal presentation of stimuli, it is clearly a suboptimal presentation of stimuli. As a result, participants sometimes had to guess what the object was. They also had no other information to know whether their initial perception was correct or incorrect. The fact that participants’ performance improved without time pressure may be due to response errors under time pressure and this improvement was evident independent of the race of the men in the picture.

Without time pressure, participants shot 85% of armed Black men and 83% of armed White men. For unarmed men, participants shot 28% Black men and 25% White men. The statistical comparison of these differences showed weak effect of a systematic bias. The comparison for unarmed men produced a p-value that was just significant with the standard criterion of alpha = .05 criterion, F(1,53) = 6.65, p = .013, but not the more stringent criterion of alpha = .005 that is used to predict a high chance of replication. The same is true for the comparison of responses to pictures of unarmed men, F(1,53) = 4.96, p =.031. To my knowledge, this study has not been replicated and Gawronski et al.’s claim rests entirely on this single study.

Even if these effects could be replicated in the laboratory, they do not provide any information about unconscious biases in the real world because the study lacks ecological validity. To make claims about the real world, it is necessary to study police officers in simulations of real world scenarios (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021). This research is rare, difficult, and has not yet produced conclusive results. Andersen et al. (2021) found a small racial bias, but the sample was too small to provide meaningful information about the amount of racial bias in the real world. Most important, however, real-word scenarios provide ample information to see whether a suspect is Black or White and is armed or not. The real decision is often whether use of force is warranted or not. Racial biases in these shooting errors are important, but they are not unconscious biases.

Contrary to Gawronski et al., I do not believe that social cognition researchers focus on automatic biases rather than unconscious biases was a mistake. The real mistake was the focus on reaction times in artificial computer tasks rather than studying racial biases in the real world. As a result, thirty years of research on automatic biases has produced little insights into racial biases in the real world. To move the field towards the study of unconscious biases would be a mistake. Instead, social cognition researchers need to focus on outcome variables that matter.


The term implicit bias can have different meanings. Gawronski et al. (2022a) proposed to limit the meaning of the term to unconscious bias. I argue that this definition of implicit bias is not useful because most studies of implicit cognition are studies in which racial stereotypes and attitudes toward stigmatized groups are automatically activated. In contrast, priming studies that tried to distinguish between conscious and unconscious activation of this information have been discredited during the replication crisis and there exists no credible empirical evidence to suggest that unconscious biases exist or contribute to real-world behavior. Thus, funding a new research agenda focusing on unconscious biases may waste resources that are better spent on real-world studies of racial biases. Evidently, this conclusion diverges from the conclusion of implicit cognition researchers who are interested in continuing their laboratory studies, but they have failed to demonstrate that their work makes a meaningful contribution to society. To make research on automatic biases more meaningful, implicit bias research needs to move from artificial outcomes like reaction times on computer tasks to actual behaviors.

Appendix 1

Implicit Cognition Research Focusses on Automatic (Not Unconscious) Processes

Gawronski & Bodenhausen (2006), WOS/11/22 1,537

“If eras of psychological research can be characterized in terms of general ideas, a major theme of the current era is probably the notion of automaticity” (p. 692)

This perspective is also dominant in contemporary research on attitudes, in which deliberate, “explicit” attitudes are often contrasted with automatic, “implicit” attitudes (Greenwald & Banaji, 1995; Petty, Fazio, & Brin˜ol, in press; Wilson, Lindsey, &
Schooler, 2000; Wittenbrink & Schwarz, in press).

“We assume that people generally do have some degree of conscious access to their automatic affective reactions and that they tend to rely on these affective reactions in making evaluative judgments (Gawronski, Hofmann, & Wilbur, in press; Schimmack & Crites, 2005) (p. 696).

Conrey, Sherman, Gawronski, Hugenberg, & Groom (2005) , WOS/11/22

“The distinction between automatic and controlled processes now occupies a central role in many areas of social psychology and is reflected in contemporary dual-process theories of prejudice and stereotyping (e.g., Devine, 1989)” (p. 469)

“Specifically, we argued that performance on implicit measures is influenced by at least four different processes: the automatic activation of an association (association activation), the ability to determine a correct response (discriminability), the success at overcoming automatically activated associations (overcoming bias), and the influence of response biases
that may influence responses in the absence of other available guides to response (guessing)” (p. 482)

Gawronski & DeHouwer (2014), WOS 11/22 240

” other researchers assume that the two kinds of 11lL’asurcs tap into distinct memory representations, such that explicit measures tap into conscious representations whereas implicit measures tap into unconscious representations (e.g., Greenwald &
Banaji, 1995). Although the conceptualizations arc relatively common in the literature on implicit measures, we believe that it is concecptually more appropriate to classify different measures in terms of whether the tobe-measured psychological attribute influences participants’ responses on the task in an automatic fashion (De Houwer, Teige-Mocigemba, Spruyt, & Moors, 2009).” (p. 283)

Hofmann, Gawronski, Le, & Schmitt, PSPB, 2005, WoS/11/22

“These [implicit] measures—most of them based on reaction times in response compatibility tasks (cf. De Houwer, 2003)—are intended to assess relatively automatic mental associations that are difficult to gauge with explicit self-report measures”. (p. 1369)

Gawronski, Hofmann, & Wilbur (2006), WoS/11/22 200

“A common explanation for these findings is that the spontaneous behavior assessed in these
studies is difficult to control, and thus more likely to be influenced by automatic evaluations, such as they are reflected in indirect attitude measures” (p. 492)

“there is no empirical evidence that people lack conscious awareness of indirectly assessed attitudes per se” (p. 496)

Gawronski, LeBel, & Peters, PoPS (2007) WOS/11/22 187

“The central assumption in this model is that indirect measures provide a proxy for the activation of associations in memory” (p. 187)

Gawronski & LeBel, JESP (2008) WOS/11/22

“We argue that implicit measures provide a proxy for automatic associations in memory,
which may or may not influence verbal judgments reflected in self-report measures” (p. 1356)

Deutsch, Gawronski, & Strack, JPSP (2006), WOS/11/22 122

“Phenomena such as stereotype and attitude activation can be readily reconstructed as instance-based automaticity. For example, perceiving a person of a stereotyped group or an
attitude object may be sufficient to activate well-practiced stereotypic or evaluative associations in memory” (p. 386)

Implicit measures are important even if they do not assess unconscious processes.

Hofmann, Gawronski, Le, & Schmitt, PSPB, 2005, WoS/11/22

” Arguably one of the most important contributions in social cognition research within the last decade was the development of implicit measures of attitudes, stereotypes, self-concept, and self-esteem (e.g., Fazio, Jackson, Dunton, & Williams, 1995; Greenwald, McGhee, & Schwartz, 1998; Nosek & Banaji, 2001; Wittenbrink, Judd, & Park, 1997).” (p. 1369)

Gawronski & DeHouwer (2014), WOS 11/22 240

“For the decade to come, we believe that the field would benefit from a stronger focus on underlying mechanisms with regard to the measures themselves as well as their capability to predict behavior (see also Nosek, Hawkins, & Frazier, 2011).” (p. 303)

Lost in Latent Variable Space

Post-war American Psychology is rooted in behaviorism. The key assumption of behaviorism is that psychology (i.e., the science of the mind) should only study phenomena that are directly observable. As a result, the science of the mind became the science of behavior. While behaviorism is long dead (see the 1990 funeral here), it’s (harmful) effect on psychology is still noticeable today. One lasting effect is psychologists aversion to make causal attributions to the mind (cognitive processes). While cognitive processes cannot be directly observed with the human senses (we cannot see, touch, smell, or hear what goes on in somebody’s mind), we can indirectly observe these processes on the basis of observable behaviors. A whole different discipline that is called psychometrics has developed elaborate theories and statistical models to relate observed behaviors to unobserved processes in the mind. Unfortunately, psychometrics is often not covered in the education of psychologists. As a result, psychologists often make simple mistakes when they apply psychometric tools to psychological questions.

In the language of psychometrics, observed behaviors are observed variables and unobserved mental processes are unobserved variables that are also often called latent (i.e., of a quality or state) existing but not yet developed or manifest; hidden or concealed) variables. The goal of psychometrics is to find systematic relationships between observed and latent variables that make it possible to study mental processes. We can compare this process to the task of early astronomers to make sense of the lights in the night sky. Bright stars are like observable indicators and the task of astronomers is to explain the behavior of these observable variables with unobserved forces. Astronomy has come a long way from seeing astrological signs in the sky, but psychology is pretty much at this early stage of science, where most of the unknown cognitive processes that cause observable behaviors are unknown. In fact, some psychologists still resist the idea that observable behavior can be explained by latent variables (Borsboom et al., 2021). Others, however, have used psychometric tools, but fail to understand the basic properties of psychometric models (e.g., Digman, 1997; DeYoung & Peterson, 2002; Musek, 2007). Here, I give a simple introduction to the basic logic of psychometric models and illustrate how applied psychologists can get lost in latent variable space.

Figure 2 shows the most basic psychometric model that relates an observed variable to an unobserved cause. I am using a widely used measure of life-satisfaction as an example. Please rate your life on a scale from 0 = worst possible life to 10 = best possible life. Thousands of studies with millions of respondents have used this question to study “the secret of happiness.” Behaviorists would treat this item as a stimulus and participants responses on the 11-point rating scales as behaviors. One problem for behaviorists is that participants will respond differently to the same question. Responses vary from 0 (very rarely) all the way to 10 (more often, but still rare). The modal response in affluent Western countries is 7. Behaviorism has no answer to the question why participants respond differently to the same situation (i.e., question). Some researchers have tried to provide a behavioristic answers by demonstrating that responses can be manipulated (e.g., responses are different in a good or bad mood; Schwarz & Strack, 1999; Kahneman, 2011). However, these effects are small and do not explain why responses are highly stable over time and across different situations (Schimmack & Oishi, 2005). To explain why some people report higher levels of life-satisfaction than others, we have to invoke unobserved causes within respondents’ minds. Just like forces that creates the universe, these causes are not directly observable, but we know they must exist because we observe variation in responses that cannot be explained by variation in the situation (i.e., same situation and different behaviors imply internal causes).

Psychologists have tried to understand the mental processes that produce variation in Cantril ladder scores for nearly 100 years (Andrews & Whitey, 1976; Cantril, 1965; Diener, 1984; Hartmann, 1936). In the 1980s, focus shifted from thoughts about one’s life (e.g., I hate my work, I love my spouse, etc.) to the influence of personality traits (Costa & McCrae, 1980). Just like life-satisfaction, personality is a latent variable that can only be measured indirectly by observing differences in behaviors in the same situation. The most widely used observed variables to do so are self-ratings of personality.

The key problem for the measurement of unobserved mental processes is that variation in observed scores can be caused by many different mental processes. To go beyond the level of observed variation in behaviors, it is necessary to separate the different causes that contribute to the variance in observed scores. The first step is to separate causes that produce measurement error. The most widely used approach to do so is to ask the same or similar questions repeatedly and to consider variability in responses as measurement error. The next figure shows a model for responses to two similar items.

When two or more observed variables are available, it is possible to examine the correlation between two variables. if two observed variables share a common cause, they are going to be correlated. The strength of the correlation depends on the relative strength of the shared mental process and the unique mental processes. Psychometrics works in reverse and makes inferences about the unobserved causes by examining the observed correlations. To do so, it is necessary to make some assumptions, and this is where things can go wrong, when researchers do not understand these assumptions.

A common assumption is that the shared causal processes are important and meaningful, whereas the unique mental processes are unimportant, irrelevant, and error variance. Based on this assumption, the model is often drawn differently. Sometimes, the shared unobserved variable is drawn on top, and the unshared unobserved variables are drawn at the bottom (top = important, bottom = unimportant).

Sometimes, the unique mental processes are drawn smaller and without a name.

And sometimes, they are simply omitted because they are considered unimportant and irrelevant.

The omission of the unshared causes makes sense when psychometricians communicate with each other because they are trained in understanding psychometric models and use figures merely as a short-hand to communicate with each other. However, when psychometricians communicate with psychologists things can go horribly wrong because psychologists may not realize that the omission of residuals is based on assumptions that can be right or wrong. They may simply assume that the unique variances are never important and can always be omitted. However, this is a big mistake with undesirable consequences. To demonstrate this, I am always going to show the unique causes of all variables in the following models.

When psychologists ask similar questions repeatedly, they are assuming that the unique causes of the responses are measurement error. In the present example, individuals may interpret the words “worry” and “nervous” somewhat differently and this may elicit different mental processes that result in slightly different responses. However, the two terms are sufficiently similar that they also elicit similar cognitive processes that produce a correlation between responses to the two items. Under this assumption, the common causes reflect the causes that are of interest and the unique causes produce error variance. Under the assumption that unique causes produce error variance, it is possible to average responses to similar items. These averages are called scales. Averaging amplifies the variance that is produced by shared causes.

This is illustrated in the next figure where the average is fully determined by the two observed variables “I often worry” and “I am often nervous.” To make this a measurement model, we have to relate the average scores to the unobserved variables. Now we see that the shared mental process variable has two ways to influence the average scores, whereas each of the unique causes has only one way to contribute to the average. As the number of variables increases the ratio (2:1) becomes even bigger for the shared variable (3 variables, 3:1). This implies that the shared mental processes more and more determine the average scores. This is the only part of measurement theory that psychologists are taught and understand as reflected in the common practice to report Cronbach’s alpha (a measure of the shared variance in the average scored) as evidence that a measure is a good measure (Flake & Fried, 2020). However, the real measurement problems are not addressed by averaging across similarly-worded items. This is revealed in the next figure.

To use the average of responses to similarly worded items as an observed measure of an unobserved personality trait, we have to assume that the shared mental processes that produce most of the variance in the average scores are caused by the personality trait that we are trying to measure. In the present example, personality psychologists use items like “worry” and “nervous” to measure a trait called Neuroticism. Despite 100 years of research, it is still not clear what Neuroticism is and some psychologists still doubt that Neuroticism even exists. Those who do believe in Neuroticism assume it reflects a general disposition to have more negative thoughts (e.g., low self-esteem, pessimism) and feelings (anxiety, anger, sadness, guilt). The main problem in current personality research is that item-averages are often treated as if they are perfect observed indicators of an unobserved personality trait (see next figure).

Ample research suggests that average scores of neuroticism items are also influenced by other factors such as socially desirable responding. Thus, it is a simplification to assume that item-averages are identical or isomorphous to the personality trait that they are designed to measure. Nevertheless, it is common for personality psychologists to study the influence of unobserved causes like Neuroticism by means of item averages. As we see later, even when psychologists use latent variable models, Neuroticism is just a label for an item-average. The problem with this practice is that it gives the illusion that we can study the causal effects of unobservable personality traits by examining the correlations of observable item-averages.

In this way, measurement problems are treated as unimportant, just like behaviorists considered mental processes as unimportant and relegated them to a black box that should not be examined. The same attitude prevails today with regards to personality measurement, when boxes (observed variables) are given names without checking that the labels actually match the content of the box (i.e., the unobserved causes that a measure is supposed to reflect). Often psychological constructs are merely labels for item-averages. Accordingly, neuroticism is ‘operationalized’ with an item-average and neuroticism can be defined as “whatever a neuroticism scale measures.”

When Things Go from Bad to Worse

In the 1980s, personality psychologists came to a broad consensus that the diversity of human traits (e.g., anxious, bold, curious, determined, energetic, frank, gentle, helpful, etc.) can be organized into a taxonomy with five broad traits, known as the Big Five. The basic idea is illustrate in the next Figure with Neuroticism. According to Big Five theory, Neuroticism is a general disposition to experience more anxiety, anger, and sadness. However, each emotion also has its one dispositions. Thus, variation in scales that measure anxiety, anger, and sadness is influenced by both Neuroticism (i.e., the general disposition) and specific causes. In addition, scales can also be influenced by general and specific measurement errors. The figure makes it clear that the scores in the item-averages can reflect many different causes aside from the intended broader personality trait called Neuroticism. This makes it risky to rely on these item averages to draw inferences about the unobserved variable Neuroticism.

A true science of personality would try to separate these different causes and to examine how they relate to other variables. However, personality psychologists often hide the complexity of personality measurement by treating personality scales as if they directly reflect a single cause). While this is bad enough, things get even worse when personality psychologists speculate about even broader personality traits.

The General Personality Factor (Musek, 2007)

The Big Five were considered to be roughly independent from each other. In fact, they were found with a method that looked for independent factors (another name for unobserved variables) more commonly used in personality research. However, when Digman (1997) examined correlations among item-averages, he found some systematic patterns in these correlations. This led him to postulate even broader factors than the Big Five that might explain these patterns. The problem with these theories is that they are no longer trying to relate observed variables to unobserved variables. Rather, Digman started to speculate about causal relationships among unobserved variables on the basis of imperfect indicators of the Big Five.

The first problem with Digman’s attempt to explain correlations among unobserved variables was that he lacked expertise in the use of psychometric models. As a result, he made some mistakes and his results could not be replicated (Anusic et al., 2009). A few years later, a study that controlled for some of the measurement problems by using self-ratings and informant ratings suggested that the Big Five are more or less independent and that correlations reflect measurement error (Biesanz & West, 2004; see also Anusic et al., 2009). However, other studies suggested that higher-order factors exists and may have powerful effects on people’s lives, including their well-being. Subsequently, I am going to show that these claims are based on a simple misunderstanding of measurement models that treat unique variance in the Big Five scales as error variance.

Musek (2007) proposed that correlations among Big Five scales can be explained with a single higher-order factor. This model is illustrated in his Figure 1.

First, it is notable that the unique mental processes that contribute to each of the Big Five scales are called e1 to e5 and the legend of the figure explains that e stands for error variances. This terminology can be justified if we treat Big Five scales only as observed variables that help us to observe the unobserved variable GFP. As GFP is not directly observable, we have to infer its presence from the correlations among the observed variables, namely the Big Five scales. However, labeling the unique causes that produce variation in Neuroticism scores error variance is dangerous because we may think that the unique variance in Neuroticism is no longer important; just error. Of course, this variance is not error variance in some absolute sense. After all, Neuroticism scales exists only because personality psychologists assume that Neuroticism is a real personality trait that is related to even more specific traits like anxiety, anger, and sadness. Thus, all of the variance in a neuroticism scale is assumed to be important and it would be wrong to assume that only the variance shared with other Big Five scales is important. To avoid this misinterpretation, it would be better to keep the unique causes in the model.

Another problem of this model is that the model itself provides no information about the actual causes of the correlations among the Big Five scales. This is different when items are written for the explicit purpose of measuring something that they have in common. In contrast, the correlations among the Big Five traits are an empirical phenomenon that requires further investigation to understand the nature of the causal processes that produce correlations. In other words, GFP is just a name for “shared cognitive processes;” it does not tell us what these shared cognitive processes are. To examine this question, it is necessary to see how the GFP is related to other things. This is where things go horribly wrong. Rather than relating the unobserved variable in Figure 1 to other measures, Musek (2007) averages all Big Five items to create an item average that is supposed to represent the unobserved variable. He then uses correlations of the GFP scale to make inferences about the GFP factor. The problems of this approach are illustrated with the next figure.

The figure illustrates that the general personality scale is not a good indicator of the general personality factor. The main problem is that the scale scores are also influenced by the unique causes that contribute to variation in the Big Five scales (on top of measurement error that is not shown in the picture to avoid clutter, but should not be forgotten). The problem is hidden when the unique causes are represented as errors, but unique variance in neuroticism is not error variance. It reflects a disposition to have more negative thoughts and this disposition could have a negative influence on life-satisfaction. This contribution of unique causes is hidden when Big Fife scale scores are averaged and labeled General Personality.

Musek (2007) reports a correlation of r = .5 (Study 1) between the general personality scale and a life-satisfaction scale. Musek claims that this high correlation must reveal a true relationship between the general factor of personality and life-satisfaction and cannot reflect a method artifact like social desirable responding. It is unclear why Musek (2007) relied on an average of Big Five scale scores to examine the relationship of the general factor with life-satisfaction. Latent variable modeling makes it possible to examine the relationship of the general factor directly without the need for scale scores. Fortunately, it is possible to conduct this analysis post-hoc based on the reported correlations in Table 1.

The first model created a general personality scale and used the scale as a predictor of life-satisfaction. The only difference to a simple correlation is that the model also includes the implied measurement model. This makes the model testable because it imposes restrictions on the correlations of the Big Five scales with the life-satisfaction scale. The fit of the model was acceptable, but not great, suggesting that alternative models might produce even better fit, RMSEA = .078, CFI = .958.

In this model, it is possible to trace the paths from the unobserved variables to life-satisfaction. The strongest relationship was the path from the general personality factor (h) to life-satisfaction, b = .42, se = .04, but the model also implied that unique variances of the Big Five scales contribute to life-satisfaction. These effects are hidden when the general personality scale is interpreted as if it is a pure measure of the general personality factor.

A direct test of the assumption that the general factor is the only predictor of life-satisfaction requires a simple modification of the model that links life-satisfaction directly to the general factor (h). This model actually fits the data better, RMSEA = .048, CFI = .984. This might suggest that the unique causes of variation in the Big Five are unrelated to life-satisfaction.

However, good fit is not sufficient to accept a model. It is also important to rule out plausible alternative models. An alternative model assumes that the Big Five factors are necessary and sufficient to explain variation in life-satisfaction. There is no reason to create a general scale and use it as a predictor. Instead, life-satisfaction can simply be regressed onto the Big Five scales as indicator of the Big Five factors. In fact, it is always possible to get good fit for a model that uses indicators as predictors of outcomes because the model does not impose any restrictions (i.e., the model is just identified). The only reason why this model fits worse than the other model is that fit indices like RMSEA and CFI reward parsimony and this model uses 5 predictors of life-satisfaction whereas the previous model had only one predictor. However, parsimony cannot be used to falsify a model.

In fact, it is possible to find an even better fitting model because only two of the five Big Five scales were significant predictors of life-satisfaction. This finding is consistent with many previous studies that these two Big Five traits are the strongest predictors of life-satisfaction. If the model is limited to these two predictors, it fits the data better than the model with a direct path from the general factor, CFI = .987, RMSEA = .045. Musek (2007) was unable to realize that the unique variances in neuroticism and extraversion make a unique contribution to life-satisfaction because the general personality scale does not separate shared and unique causes of variation in the Big Five scales.

The Correlated Big Two

In contrast to Musek (2007), DeYoung and Peterson favor a model with two correlated higher-order factors (DeYoung, Peterson, & Higgins, 2002; see Schimmack, 2022, for a detailed discussion).

As Musek (2007) they treat the unique causes of variation in Big Five traits as error (e1-e5) and assume that relationships of the higher-order factors with criterion variables are direct rather than being mediated by the Big Five factors. Here, I fitted this model to Musek’s (2007) data. Fit was excellent, CFI = .996, RMSEA = .030.

Based on this model, life-satisfaction would be mostly predicted by stability rather than neuroticism and extraversion or a general factor. However, just because this model has excellent fit doesn’t mean it is the best model. The model simply masks the presence of a general factor by modeling the shared variance between Plasticity and Stability as a correlated residual. It is also possible to model it with a general factor. In this model, Stability and Plasticity would be an additional level in a hierarchy between the Big Five and the General Factor. This model does not impose any additional restrictions and fits the data as well as the previous model, CFI = .996, RMSEA = .030. Thus, even though Stability and Plasticity can be identified, it does not mean that this distinction is important for the prediction of life-satisfaction. The general factor could still be the key predictor of life-satisfaction.

However, both models make the assumption that the unique causes of variation in Big Five scales are unrelated to life-satisfaction, and we already saw that this assumption is false. As a result, the model that relates life-satisfaction to neuroticism and extraversion fits the data, CFI = .994, RMSEA = .035, and the paths from extraversion and neuroticism to life-satisfaction were significant.

Musek (2007) and DeYoung et al. (2006) ignored the possibility that unique causes of variation in the Big Five contribute to the prediction of other variables because they made the mistake to equate unique variances with error variances. This interpretation is based on the basic examples that are used to illustrate latent variable models for beginners. However, the interpretation of all aspects of a latent variable model, including the residual or unique variances has to be guided by theory. To avoid these mistakes, psychometricians need to stop presenting their models as if they can be used without substantive theory and substantive researchers need to get better training in the use of psychometric tools.


Compared to other sciences like physics, astronomy, chemistry, or biology, psychology has made little progress over the past 40 years. While there are many reasons for this lack of progress, one problem is the legacy of behaviorism to focus on observable behaviors and to rely on experimentation as the only scientific approach to test causal theories. Another problem is an ideological bias against personality as a causal force that produces variation between individuals (Mischel, 1968). To make progress, personality science has to adopt a new scientific approach that uses observed behaviors to test causal theories of unobservable forces like personality. While personality scales can be used to predict behaviors and life-outcomes, they cannot explain behaviors and life-outcomes. Latent variable modeling provides a powerful tool to test causal theories. The biggest advantage of latent variable modeling is that model fit can be used to reject models. A cynic might think that this is the main reason why they are ot used more by psychologists because it is more fun to build a theory and confirm it rather than to find out that it was false, but fun doesn’t equal scientific progress.

P.S. What about Network Models?

Of course, it is also possible to reject the idea of unobserved variables altogether and draw pictures of the correlations (or partial correlations) among all the observed variables. The advantage of this approach is that it always produces results that can be used to tell an interesting story about the data. The disadvantage is that it always produces a result and therefore doesn’t test any theory. Thus, these models cannot be used to advance personality psychology towards a science that progresses by testing and rejecting false theories.

Princeton Talk About Z-Curve

Awards, Ivy League universities, or prestigious journals are suboptimal heuristics to evaluate people’s work, but in a world of information overflow, they influence the popularity of ideas. Therefore, I am caching in on Jason Geller’s invitation to present z-curve in the Advanced Research Methods seminar at Princeton.

The talk was recorded and Jason and Princeton University generously shared the recording with me (Video). The talk builds on previous talks, but incorporates the latest z-curve findings that demonstrate the power of z-curve to predict replication failures and to justify the use of alpha = .005 as a reasonable criterion for significance tests to keep the risk of false positive results in psychological journals at a reasonably low level.

You can find many other z-curve related articles and studies on my blog. Here I want to mention only the two peer-reviewed articles that introduced the method and provide more detailed information about the method.

Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance

Z-Curve 2.0: Estimating Replication Rates and Discovery Rates

To conduct your own z-curve analysis, you can use the z-curve package in R.

Racism and Traffic Stops

Recently, a team of German sociologists combined data about racial biases in police stops in the United States (Stanford Open Policing Project ; Pierson et al., 2020) and data about county-level average levels of racial biases collected by Project Implicit (Xu et al., 2022). The key finding was that various measures of racial bias were correlated with racial bias in traffic stops by police (published in the Supplement Table 2).

The authors missed an opportunity to examine the validity of different measures of racial attitudes under the assumption that all measures, implicit and explicit, reflect a common attitude rather than distinct attitudes (Schimmack, 2021). If implicit measures tapped some distinct form of unconscious bias, they should show incremental predictive validity. To examine this question, I used the correlations in Table 2 and fitted a structural equation model to the data. I found that a model with a single racial bias factor fitted the data reasonably well, chi2 (df = 9, N ~ 300) = 34.52, CFI = .975, RMSEA = .097. The effect size of b = .369 for bias implies that for every increase in bias by one standard deviation, there is a .369 increase in racial bias in traffic stops. This is considered a moderate effect size in comparison to other effect sizes in the social sciences.

,The more interesting result is that the race IAT and simple self-report measures of racial bias are equally valid measures of counties’ average level of racial bias. The effect sizes are .797 for the feeling thermometer, .784 for a simple preference rating, and .834 for the race Implicit Association Test; a computerized task that is less susceptible to socially desirable responding. The high validity coefficients of these measures can be explained by the aggregation of individuals’ scores. Aggregation reduces random measurement error as well as systematic biases that are unique to individuals. Thus, the present results show that race IAT scores are valid measures of racial biases at the aggregated level. The results also show that self-ratings provide as much valid information. This undermines claims by Greenwald, who developed the IAT, that the race IAT is a more valid measure of racial biases than self-ratings (see also Schimmack, 2021, for studies at the individual level).

The figure also shows an additional relationship between the race IAT and the weapons IAT. This relationship reveals that IAT tasks reflect some information that is not captured by self-reports. However, it is not clear whether this variance is method variance or valid variance of unconscious bias. In the latter case, the unique variance in the race IAT could predict police stops in addition to the bias factor (incremental predictive validity).

Adding this path did not improve model fit and the effect size estimate was not significantly different from zero, b = -.045, 95%CI = -.305 to .214. These results are consistent with many other results that the incremental predictive validity of the race IAT is elusive and even if it is not zero, it is likely to be negligible (Kurdi et al., 2019).

In short, the article could have made a nice contribution to the literature by demonstrating that implicit and explicit measures of racial bias show high convergent validity when they are aggregated to measure racial bias of US counties, and by demonstrating that racial bias predicts an important behavior, namely police officers’ decision to conduct a traffic stop.

However, the discussion of the results in the article is problematic and may reveal a sociological bias or the lack of lived experience of German researchers. The authors interpret the results as evidence that situational factors explain the results.

The observed relationships between regional-level bias and police traffic stops underscore the role of the context in which police officers operate. Our findings are consistent with theorizing by Payne et al. (2017), who argued that some contexts expose individuals more regularly to stereotypes and/or prejudice, increasing mental accessibility of biased thoughts and feelings, in turn influencing individual behavior. Consequently, behavioral expressions of prejudice and stereotypes often reflect properties of contexts rather than stable dispositions of people (but see Connor & Evers, 2020).”

The plausible alternative explanation is relegated to a “but see.” As a German who has lived in the United States and is constantly exposed to US media while living in Canada, I think the “but” deserves more attention and is actually a more plausible explanation of these findings. After all, police officers are not Robo-Cops or United Nations soldiers. They are typically born and raised in the county or in close proximity they are working in (Flint Town). As a result, their own racial biases are likely to be similar to the racial biases measured in the Project Implicit data (see Andersen et al., 2021, for race IAT scores of police officers). Thus, it is entirely possible that racial biases of police officers, rather than some mysterious unidentified social context, contribute to the racial biases in police stops. This does not mean that social factors are not at play. The fact that racial bias is not some involuntary, unconscious bias means that better training and incentives can be used to reduce bias in police officers’ behaviors without changing their attitudes and feelings. Traffic stops are clearly deliberate actions that are not made in a split second. Thus, officers can be trained in avoiding biases in their actions without the need to change their implicit or explicit attitudes. Although attitude change would be desirable, it is difficult and will take time. For now, Black citizens are likely to settle for equal treatment rather than waiting for changes in implicit attitudes that are difficult to measure and have no known effects on behavior.

In conclusion, it is well known that racism is a problem among US police officers. Often these officers are known and remain on the force. This study shows that these racial attitudes have clear consequences that sometimes lead to the death of innocent Black civilians. To attribute these incidences to some abstract contextual factors ignores the lived experiences of thousands of African Americans. The data are fully consistent with the common assumption of African Americans that racists cops are more likely to pull them over. The present study showed that this fear is more justified in counties with higher levels of racism.

Personality Misch-Masch-urement

Lew Goldberg has made important contributions to personality psychology. He contributed to the development of the Big Five model that is currently the most widely accepted model of the higher-order factors of personality that describe the relationship among the basic trait words used in everyday language.

He also pioneered open science when he made a large pool of personality items available to all researchers and created open and free measures that mimic proprietary measures like Costa and McCrae’s NEO scales. Because these measures were designed to measure the original scales as closely as possible, the validity of the scales is defined in terms of correlations with the existing scales. The goal of the IPIP project was not to examine validity or to improve on existing measures. As Lew pointed out in a personal correspondence to me, users of IPIP measures could have created new measures based on the initial 300 items. The fact that users of these items have failed to do so shows a lack of interested in construct validation. Thanks to Lew Goldberg, we have open items and open data to develop better measures of personality.

The extended 300-item IPIP measure has been used to provide thousands of internet users free feedback about their personality, and Johnson made his data from these surveys openly available (OSF-data).

The present critical examination of the psychometric properties of the IPIP scales would not be possible without these contributions. My main criticism of personality measurement is that personality psychologists have not used the statistical tools that are needed to validate a personality measure. A common and false belief among personality psychologists is that these tools are not suitable for personality measures. A misleading article by McCrae, Costa and colleagues in the esteemed Journal of Personality and Social Psychology did not help. The authors were unable to fit a Big Five model to their data. Rather than questioning the model, they decided that the method is wrong because “we know that the Big Five model is right”. This absurd conclusion has been ridiculed by psychometricians (Borsboom, 2006), but led only to a defensive response by personality psychologists (Clark, 2006). For the most part, personality psychologists today continue to create scales or use scales that lack proper validation. The IPIP-300 is no exception. This blog post is just illustrates with a simple example how bad measurement can derail science.

The IPIP-300 aims to measure 30 personality traits that are called facets. Facets are constructs that are more specific than the Big Five and closer to everyday trait concepts. Each facet is measured with 10 items. The 10 items are summed or averaged to give individuals a score for one of the 30 facets. Each facet has a name. There are two ways to interpret these names. One interpretation is that the name is just a short-hand for a scientific construct. For example, the term Depression is just a name for the sum-scores of 10-items from the IPIP. To know what this sum score actually measures, one might need to examine the item content, learn about the correlations of this sum-score with other sum-scores, or understand the scientific theory that let to the creation of the 10-items. Accordingly, the Depression scale measures whatever it is supposed to measure and what this is is called Depression. In this case, we could change the name of the scale without changing anything in our understanding of the scale. We could call it the D-scale or just facet number 3 of Neuroticism. Depression is just a name. The alternative view assumes that the 10-items were selected to measure a construct that is at least somewhat related to what we mean by depression in our everyday language. For example, we would be surprised to see the item “I like ginger” or “I often break the rules” in a list of items that are supposed to measure depression. The use of everyday trait worlds as labels for scales does usually imply that researchers are aiming to measure a construct that is at least similar to the everyday meaning of the label. Unfortunately, this is often not the case and interpreting scales based on their labels can lead to misunderstandings.

To illustrate the problem of misch-mesch-urement, I am using two facets scales from the IPIP-300 that are labeled Depression and Modesty. I used the first 10,000 observations in Johnson’s dataset and selected only US respondents with complete data (N = 6,786). The correlation between Depression and Modesty was r = .35, SE = .01. I replicated this finding with the next 10,000 observations, again selecting only US respondents with complete data (N = 5,864), r = .39, SE = .01. The results clearly show a moderate positive relationship between the two scale scores. A correlation of r = .35 implies that a respondent who is above average in Depression has about a 67.5% probability to be also above average in Modesty. We could now start speculating about the causal mechanism that produces this correlation. Maybe bragging (not being modest) reduces the risk of depression. Maybe being depressed lowers the probability of bragging. Maybe it is both and maybe there are third variables at play. However, before we even start down this path, we have to consider the possibility that the sum score labels are misleading and we are not even seeing the correlation between the constructs that we have in mind when we talk about depression and modesty. This question is examined by fitting a measurement model to the items that were used to create the sum scores.

Of course, the two scales were chosen because a simple measurement model does not fit the data. This is shown with a simplified figure of a measurement model that assumes the 10 items of a scale all reflect a common construct and some random measurement error. The items are summed to reduce the random measurement error so that the sum score mostly reflects the common construct. The main finding is that this simple model does not meet standard criteria of acceptable fit such as a Comparative Fit Index (CFI) greater than .95 or a Root Mean Square Error of Approximation (RMSEA) below .06. Another finding is that the correlation between the factors (i.e., unobserved variables that are assumed to cause the shared variance among items) is even stronger, r = .69, than the correlation among the scales. This would be interpreted as evidence that measurement error reduces the correlation with scales and the correlation among the factors shows the true correlation. However, the model does not fit and the correlation should not be interpreted.

Inspection of the items suggests some reasons why the simple model may not fit and why the positive correlation is at least inflated, if not totally an artifact. For example, the item “Have a low opinion of myself” is used as an item to measure Depression, while the item “Have a high opinion of myself ” is reversed and used to measure Modesty (reverse scoring means that low ratings on this item are scored as high modesty). Just looking at the items, we might suspect that they are both measures of low and high self-esteem, respectively. While it is plausible that Depression and Modesty are linked to low self-esteem, but it is a problem to use self-esteem items to measure both. This will produce an artificial positive correlation between the scales and lead to the false impression that Depression and Modesty are positively correlated when they are actually unrelated or even negatively related. This is what I call the misch-masch problem of personality measurement. Scales are foremost averages of items and it is not clear what these scales measure if the scales are not properly evaluated with a measurement model.

As items are closer to the level of everyday conversations about personality, it is not difficult to notice other similarities between items. For example, “often feel blue” and “rarely feel blue” are simply oppositely worded questions about the same feeling. These items should correlate more strongly (negatively) with each other than the item “rarely feel blue” and “feel comfortable with myself”. However, our interpretation of items may differ from the interpretation of the average survey respondent. Thus, we need to examine empirically the pattern of correlations. One reason why personality researchers do not do this is another confusion caused by a bad label. The best statistical tool to explore the pattern of correlations among items with called Confirmatory Factor Analysis. The label “Confirmatory” has led to the false impression that this method can only be used to confirm a theoretical model. But when a model like the simple model in Figure 1 does not fit, we do not have a theory to suggest a more complex model. We could of course explore the data, but the term confirmatory implies that this would be wrong or an abuse of a method that should not be used for exploration. This is pure nonsense. We can use CFA to explore the data, find a plausible model that fits the data, and then confirm this model with a new dataset. We can then also use this model to make new predictions, test them, and if the predictions fail, further revise the model. This is called science and fully in line with Cronbach and Meehl’s (1955) approach to construct validation. Why do I make such a big deal about this? Because my suggestion to use CFA to explore personality data has been met with a lot of resistance by veteran personality psychologists.

In response to a related blog post, William Revelle wrote me an email.

Inspired by your blog on how one needs to use CFA to do hierarchical models (which is in fact, incorrect), I prepared the enclosed slides.
I try to point out that EFA approaches can a) give goodness of fit tests and  b) do hierarchical models.
In a previous post you suggested that those of us in personality should know some psychometrics and not use simple sum scores.  I think you are correct with respect to the first part of your argument, but you might find my paper with Keith Widaman a useful response suggesting that sum scores are not as bad as you think.
Your comment about some people (i.e., our Dutch friend) refusing to understand the silliness of a general factor of personality was most accurate. 

Bill is right that EFA can sometimes produce the right results, but this is not a good argument to use an inferior method. The key problem of EFA is that it does not require any theory and as a result also does not test a theory. If a model does not fit, researchers cannot change the model because the model is based on a stringent set of mathematical principles that are not based on any substantive theory. In contrast, CFA requires that researchers think about their data and why the model does not fit.

In response to my CFA analysis of Costa and McCrae’s NEO-PI-R, Robert McCrae wrote this response:

I just read your blog on “what lurks beneath”. I must say that I find the blog format disconcerting, both for its informality and its lack of editing and references. But here are a few responses.
1. We certainly agree that people ought to measure facets as well as domains; that personality is not simple structured; that there is some degree of evaluative bias in any single source of data.
2. What we argued in the 1996 paper was that CFA “as it has typically been applied in investigating personality structure, is systematically flawed” (p. 552, italics added). I should think you would agree with that position; you have criticized others for failing to acknowledge secondary loadings and evaluative biases in their CFAs.
3. Why in the world do you think that “CFA is the only method that can be used to test structural theories”? If that were true, I would agree with your position. But the major point of our paper was to offer an alternative confirmatory approach using targeted rotation. There are a number of instances where this method has led to falsification of hypotheses—John’s study of personality in dogs and cats showed that the FFM doesn’t fit even after targeted rotation.
4. I would have liked a comparison with Marsh’s ESEM, which was developed in part in response to our 1996 paper.
5.”The evaluation of model fit was still evolving”. That, I would say, is an understatement. In my experience, most fit indices in SEM and other statistical approaches are essentially as arbitrary as p < .05. There are virtually no empirical tests of the utility of fit indices. And most are treated as dichotomies: A model fits or not. That is like deciding that coefficient alpha should be .70, and throwing out a scale because its alpha is only .69. I recall a paper on national levels of traits in which the authors were told by reviewers not to report the observed means because they could not demonstrate measurement invariance. This is statistically-mandated data suppression.
6. I am not quite convinced by your analysis of evaluative bias in the NEO data. It is really difficult to separate substance from style in mono-method data. One could argue that the factor you call EVB is really N, and vice-versa. I have attached a chapter in which we reported joint factor analyses of self-reports and observer ratings and included bias factors (pp. 280-283).

I was fortunate to take a CFA (SEM) course offered by Ralf Schwarzer at the Free University Berlin in the early 1990s. I have been using LISREL, EQS, and now MPLUS for 30 years. I thought, the older professors were just too old to learn this method, and that the attitudes would change. However, in 2006 Borsboom wrote his attack on bad practices in personality research and measurement is still considered a secondary topic in graduate education. This attitude towards measurement has been called a measurement-schmeasurement attitude (Flake & Fried, 2020). It is time to end this embarrassing status quo and to take measurement seriously.

After exploring the data and trying many different models, I settled on a model that fits the data. I then cross-validated this model in the second dataset. However, given the large sample sizes, the structure is very robust and the model had nearly identical fit in the second dataset. The model fit of the cross-validated model also met standard fit criteria, CFI = .983, RMSEA = .035. This does not mean that it is the best model. As the data are open, other researchers could try to find better models. Importantly, minor differences between models are not important, as long as the main results are consistent. The model also does not automatically tell us what the 10-item scales measure. This question can only be answered with additional data that relate the factors in the model to other variables. However, we can at least see how items are related to the factors that the scales aimed to measure.

Figure 2 shows that it is possible to describe the correlations among items from the same scale with three factors that are simply labeled Dep1, Dep2, and Dep3 for Depression and Mod1, Mod2, and Mod3 for Modesty. Dep1 is mainly related to feeling blue and depressed. Dep2 is related to low self-esteem. Dep 3 is related to two items that might be interpreted as pessimism. Mod1 is related to low self-esteem, Mod2 is about bragging, and Mod3 is about avoiding being the center of attention. As predicted by the similar wording, two self-esteem items of Mod2 are also related to the Dep2 factor. In addition, the Modesty factor is also related to Dep2, presumably because modest participants do not rate themselves lower on self-estem items. However, there is no relationship to Dep1, the feeling blue factor. Thus, Modesty is not related to feeling depressed, as implied by the Depression label of the scale. In fact, the correlation between the Depression and Modesty factors is now close to zero. Thus, the strong correlation in the bad fitting model and the moderate correlation based on scale scores misrepresents the relationship between Depression and Modesty.

Simple models of two facets are just a building block along the way to testing more complex models of personality. I hope you realize that this is an important step before personality scales can be used for research and before people are given feedback about their personality online. You might be surprised that not all personality psychologists agree. Some personality researchers rather publish pretty pictures of the models in their heads without checking that they actually fit real data. For example, Colin DeYoung has published this picture to illustrate his ideas about the structure of personality.

This model implies that there should be a negative correlation between the Depression facet of Neuroticism and the Modesty facet of Agreeableness because Stability has a negative relationship with Neuroticism and a positive relationship with Agreeableness (minus times plus = minus). I shared my initial results that showed a positive correlation which contradicts his model (see also our published results by Anusic et al., 2009, that showed problems with the Stability factor).

His final response was:

Uli, I think the problem is that the actual structure is too complex to make it easily represented in a single CFA model. The point of the pictures is to show only some important aspects of the actual structure. As long as one acknowledges it’s only part of the structure, I don’t see that as a problem.”

To my knowledge he has never attempted to specify his model in more detail to accommodate findings that are inconsistent with this simple model. He also does not seem very eager to explore this question using CFA.

I suppose I could try to create a more complete CFA model, starting from the 10 aspects, which would allow correlations between Enthusiasm and Compassion and between Politeness and Assertiveness, and also would include additional paths from Plasticity and Stability to certain aspects, but even then I’d be wary of claiming it was the complete structure. Whatever might be left out could still easily lead to misfit. It would take a lot of chutzpah to claim that one was confident in understanding all details of the covariance structure of personality traits.

To me this sounds like an excuse for bad fit. The picture gets it right, even if the model does not fit. This is the same argument that was ridiculed by Borsboom’s critique of Costa and McCrae. If models are immune to empirical tests, they are merely figments of researchers’ imagination. To make scientific claims one first needs to pass the first test: show that a model fits the data, and if a simple model does not fit the data, we need to reject the simple model and find a better one. As Revelle pointed out, nowadays EFA software can also show fit indices. What he doesn’t say is that the typical EFA models have bad fit and that there is not much EFA users can do when this is the case. In contrast, CFA can be used to explore the data, find plausible models with good fit, like the one in Figure 2, and then test these models with new data. Call me crazy, but I have the chutzpah and confidence that I can find a well-fitting model for the structure of personality. In fact, I have already done so (Schimmack, 2019), and now I am working on doing the same for the IPIP-300. Stay tuned for the complete results. I hope this post made it clear why it is important to examine this question even for measures that have been used for decades in hundreds of studies.

Post-Script: When a figure says less than zero words

In a further email exchange Colin DeYoung asked me to add the following clarification.


Uli, please add the following quote to your blog post. You are misrepresenting me inasmuch as you are claiming that my theoretical position requires that your model of modesty and depression should show a negative correlation between modesty and depression. This is not true. I would absolutely never predict that, and I think quoting the passage here makes it clear why that is:

“A final note on the hierarchy shown in Fig. 1: It is necessarily an oversimplification at the levels below the Big Five, because personality does not have simple structure (Costa & McCrae, 1992; Hofstee, de Raad, & Goldberg, 1992). Some facets and aspects have associations, not depicted in the figure, with factors in other domains. This is true even between some traits located under different metatraits, which could not be related if the diagram in Fig. 1 were complete. For example, Compassion is positively related to Enthusiasm, and Politeness is negatively related to Assertiveness (DeYoung et al., 2007, 2013).”


Happy to add this to the blog post, but I do have to ask.  Is there any finding that you would take seriously to revise your model or is this model basically unfalsifiable? 

After all, I also fitted a model without higher-order factors and aspects to the 30 facets.   It would be really interesting to do a multi-method study with facet-factors as starting point, but I don’t know a study that did that or any data to do it. 


Thanks, Uli. Please do add that text to your blog post as my explanation of the figure.

As that text points out, what you’re calling my “model” is in fact just a summary of various empirical results. It is not, and has never been, intended as a formal CFA model.

[Explanation: The figure uses the symbolic language of causal modeling that links factors (circles) to other factors (circles) with arrows pointing from one factor to another (implying a causal effect or at least a representation of shared variance among factors that are related to a common higher-order factor. It is not clear what this figure could tell readers unless we believe that factors are real and at some point explain a pattern of observed correlations. To say that the model is not a CFA model is to say that the model makes no empirical predictions and that factors like Stability or Plasticity only exit as constructs in Colin’s imagination. Not sure why we should print such imaginary models in a scientific article.]

Free Expert Dating Advice

Psychologists have studied dating (also sometimes called mating by evolutionary psychologists) for 100 years (more or less). We are therefore able to give young, unexperienced novices expert advice. This advice is particularly important for young men because the human mating ritual in many cultures still puts them in the position of the actor who has to initiate a complex mating ritual (Elain on Seinfeld: “We mostly play defense”). The leading experts from elite universities like Harvard are willing to share their knowledge, but these personalized courses are not yet available, and probably not free.

Fortunately, I am able to provide free expert advice with a brief instructional video that illustrates all the things you should NOT do on a first date. Just do the opposite and you will be fine. Please add further suggestions in the comment section. Advice from classy women is especially welcome.

Clip from Girlfriends streaming on Netflix

Attacking Unhelpful Psychometricians

Introductory Rant

The ideal model of science is that scientists are well-paid with job security to work collaboratively towards progress in understanding the world. In reality, scientists operate like monarchs in the old days or company CEO’s in the modern world. They try to expand their influence as much as possible. This capitalistic model of science could work if there was a market that rewards CEO’s of good companies for producing better products at a cheaper price. However, even in the real world, markets are never perfect. In science, there is no market and success is much more driven by many factors that have nothing to do with the quality of the product.

The products of empirical scientists often contain a valuable novel contribution, even if the overall product is of low quality. The reason is that empirical psychologists often collect new data. Even this contribution can be useless when the data are not trustworthy, as the replication crisis in social psychology has shown. However, often data are interesting and when shared can benefit other researchers. Scientists who work in non-empirical fields (e.g., mathematicians, philosophers, statisticians) do not have that advantage. Their products are entirely based on their cognitive abilities. Evidently, it is much easier to find some new data, then to come up with a novel idea. This creates a problem for non-empirical scientists because it is a lot harder to come up with an empire-expanding novel idea. This can be seen by the fact that the most famous philosophers are still Plato and Aristoteles and not some modern philosopher. It can also be seen by the fact that it is hard for psychometricians to compete with empirical researchers for attention and jobs. Many psychology departments have stopped hiring psychometricians because empirical researchers often add more to the university rankings. Case in point, my own university, including all three campuses, is one of the largest departments in the world and does not have a formally trained psychometrician. Thus, my criticism of psychometricians should not be seen as a personal attack. Their unhelpful behaviors can be attributed to a reward structure that rewards unhelpful behaviors, just like Skinner would have predicted on the basis of their reward schedule.

Measurement Models without Substance

A key problem for psychometricians is that they are not rewarded for helping empirical psychologists who work on substantive questions. Rather, they have to make contributions to the field of psychometrics. To have a big impact, it is therefore advantages to develop methods that can be used by many researchers who work on different research questions. This is like ready-to-wear clothing. The empirical researcher just needs to pick a model and plug the data into the model and the truth comes out at the other end. Many readers will realize that ready-to-wear clothing has its problems. Mainly, it may not fit your body. Similarly, a ready-to-use statistical model may not fit a research question, but users of statistical models who are not trained in statistics may not realize this and psychometricians have no interest in telling them that their model is not appropriate. As a result, we see many articles that uncritically use statistical models that are applied to the wrong data. To avoid this problem, psychometricians would have to work with empirical researchers like tailors who create custom -fitted clothing. This would produce high-quality work, but not the market influence and rewards that read-to-wear companies can make.

Don’t take my word for it. The most successful contemporary psychometrician said so himself.

The founding fathers of the Psychometric Society—scholars such as Thurstone, Thorndike, Guilford, and Kelley—were substantive psychologists as much as they were psychometricians. Contemporary psychometricians do not always display a comparable interest with respect to the substantive field that lends them their credibility. It is perhaps worthwhile to emphasize that, even though psychometrics has benefited greatly from the input of mathematicians, psychometrics is not a pure mathematical discipline but an applied one. If one strips the application from an applied science one is not left with very much that is interesting; and psychometrics without the “psycho” is not, in my view, an overly exciting discipline. It is therefore essential that a psychometrician keeps up to date with the developments in one or more subdisciplines of psychology.“ (Borsboom, 2006)

Borsboom has carefully avoided his own advice and became a rock-star for his claims that the founding people of psychometrics were all delusional because they actually believed in substances that could be measured (traits) and developed methods to measure intelligence, personality, or attitudes. Borsboom declared that personality does not exist and the tools that are used to claim they exist like factor analysis are false, and the way researchers present evidence for the existence of psychological substances outlined by two more founding psychometricians (Cronbach & Meehl, 1955) was false. Few psychometricians who gave him an award realized that his Attack of the Psychometricians (Borsboom, 2006) was really an attack of one ego-maniac psychometrician on the entire field. Despite Borsboom’s fame as measured by citations, his attack is largely ignored by substantive researchers who couldn’t care less about somebody who claims their topic of study is just a figment of imagination without any understanding of the substantive area that is being attacked.

A greater problem are psycho-metricians who market statistical tools that applied researchers actually use without understanding them. And that is what this blog-post is really about. So, end of ranting and on to showing how psychometrics without substance can lead to horribly wrong results.

Michael Eid’s Truth Factor

Psychometrics is about measurement and psychological measurement is not different from measurement in other disciplines. First, researchers assume that the world we live in (reality) can be described and understood with models of the world. For example, we assume that there is something real that makes us sometimes sweat, sometimes wear just a t-shirt, and sometimes wear a thick coat. We call this something temperature. Then we set out to develop instruments to measure variation in this construct. We call these instruments thermometers. The challenging step in the development of thermometers is to demonstrate that they measure temperature and that they are good measures of temperature. This step is called validation of a measure. A valid measure measures what it is supposed to measure and nothing else. The natural sciences have made great progress by developing better and better measures of constructs we all take for granted in everyday life like temperature, length, weight, time, etc. (Cronbach & Meehl, 1955). To make progress, psychology would also need to develop better and better measures of psychological constructs such as cognitive abilities, emotions, personality traits, attitudes, and so on.

The basic statistical tool that psychometricians developed to examine validity of psychological measures is factor analysis. Although factor analysis has developed and has become increasingly easy and cheap with the advent of powerful personal computers, the basic idea of factor analysis has remained the same. Factor analysis relates observed measures to unobserved variables that are called factors and estimates the strength of the relationship between the observed variable and the unobserved variable to provide information about the variance in a measure that is explained by a factor. Variance explained by the factor is valid variance if the factor represents the construct that a researcher wanted to measure. Variance that is not explained by a factor represents measurement error. The key problem for substantive researchers is that a factor may not correspond to the construct that they were trying to measure. As a result, even if a factor explains a lot of the variance in a measure, the measure could be a poor measure of a construct. As a result, the key problem for validation research is to justify the claim that a factor measures what it is assumed to measure.

Welcome to Michael Eid’s genius short-cut to the most fundamental challenge in psychometrics. Rather than conducting substantive research to justify the interpretation of a factor, researchers simply declare one measure as a valid measure of a construct. You may thin, surely, I am pulling wool over your eyes and nobody could argue that we can validate measures by declaring them to be valid. So, let me provide evidence for my claim. I start with Eid, Geiser, Koch, and Heene’s (2017) article that is built on the empire-expanding claim that all previous applications of another empire-expanding model called the bi-factor model, are false and that researchers need to use the authors model. This article is flagged as highly-cited in WebofScience showing that this claim has struck fear in applied researchers who were using the bi-factor model.

One problem for applied researchers is that psychometricians are trained in mathematics and use mathematical language in their articles which makes it impossible for applied researchers to understand what they are saying. For example, it would take me a long time to understand what this formula in Eid et al.’s article tries to say.

Fortunately, psychometricians have also developed a simpler language to communicate about their models that uses figures with just four elements that are easy to understand. Boxes represent measured variables where we have actual scores of people in a sample. Circles are unobserved variables where we do not have scores of individuals. Straight and directed arrows imply a causal effect. The key goal of a measurement model is to estimate parameters that show how strong these causal effects are. Finally, there are also curved and undirected paths that reflect a correlation between two variables without assuming causality. This simple language makes it possible for applied researchers to think about the statistical model that they are using to examine validity of their measures. Eid et al.’s Figure 1 shows the bi-factor models they criticize with an example of several cognitive tasks that were developed to measure general intelligence. In this model, general intelligence is an unobserved variable (g). Nothing in the bi-factor model tells us whether this factor really measures intelligence. So, we can ignore this hot-button issue and focus on the question that the bi-factor model actually can answer. Are the tasks that were developed to measure the g-factor good measures of the g-factor. To be a good measure, a measure has to be strongly related to the g-factor. Thus, the key information that applied researchers care about are the parameter estimates for the directed paths from the g-factor to the 9 observed variables. Annoyingly, psychometricians use Greek letters to refer to these parameters. An English term is factor loadings and we could just use L for loading to refer to these parameters, but psychometricians feel more like scientists when they use the Greek letter lambda.

But how can we estimate the strength of an unobserved variable on an observed variable? This sounds like magic or witch craft and some people have argued that factor analysis is fundamentally flawed and produces illusory causal effects of imaginary substances. In reality, factor analysis is based on the simple fact that causal process produce correlations. If there are really people who are better at cognitive tasks, they will do better one different tasks, just like athletic people are likely to do better on several different sports. Thus, a common cause will produce correlations between two effects. You may remember this from PSY100 where this is introduced as the third-variable problem. The correlation between height and hair-length (churches and murder rates, etc.) does not reveal a causal effect of height on hair-length or vice versa. Rather, it is produced by a common cause. In this case, gender explains most of the correlation between height and hair-length because men tend to be taller and tend to have shorter hair, producing a negative correlation. Measurement models use the relationship between correlation and causation to infer the strength of common causes on the basis of the strength of correlations among the observed variables. To do so, they assume that there are no direct causal effects of one measure on another. That is, just because we measured your temperature under your arm pits before we measured it in your ear and moth, does not produce correlations among the three measures of temperature. This assumption is represented in the Figure by the fact that there are no direct relationships among the observed variables. The correlations merely reflect common causes and when three measures of temperature are strongly correlated, it suggests that they are all measuring the same common cause.

A simple model of g might assume that performance on a cognitive measure is influenced by only two causes. One is the general ability (g) that is represented by the directed arrow from g to the variable that represents variation in a specific task and another due to factors that are unique to this measure (e.g., some people are better at verbal tasks than others). This variance that is unique to a variable is often omitted from figures, but is part of the model in Figure 1.

The problem with this model is that it often does not fit the data. Cognitive performance does not have a simple structure. This means that some measures are more strongly correlated than a model with a single g-factor predicts. Bi-factor models model these additional relationships among measures with additional factors. They are called S1, S2, and S3 (thank god, they didn’t call them sigma or some other Greek name) and S stands for specific. So, the model implies that participants’ scores on a specific measure are caused by three factors: the general factor (g), one of the three specific factors (S1, S2, or S3), and a factor that is unique to a specific measure. The model in Figure 1 is simplistic and may still not fit the data. For example, it is possible that some measures that are mainly influenced by S2 are also influenced a bit by S1 and S3. However, these modifications are not relevant for our discussion, and we can simply assume that the model in Figure 1 fits the data reasonably well.

From a substantive perspective, it seems plausible that two cognitive measures could be influenced by a general factor (e.g., some students do better in all classes than others) and some specific factors (e.g., some students do better in science subjects). So, while the bi-factor model is not automatically the correct model, it would seem strange to reject it a priori as a plausible model. Yet, this is exactly what Eid et al.’s (2017) are doing based on some statistical discussion that I honestly cannot follow. All I can say is that it from a substantive point of view, a bi-factor model is a reasonable specification of the assumption that cognitive performance can be influenced by general and specific factors and that this model predicts stronger correlations among measures that tap the same specific abilities than measures that share only the general factor as a common cause.

After Eid et al. convinced themselves, reviewers, and an editor at a prestigious journal that their statistical reasoning was sound, they proposed a new way of modeling correlations among cognitive performance measures. They call it, the Bifactor-(S-1) model. The key difference between this model and the bi-factor model is that the authors remove one of the specific factors from the model; hence, S – 1.

You might say, but what if there is specific variance that contributes to performance on these task? If these specific factors exist, they would produce stronger correlations between measures that are influenced by these specific factors and a model without this factor would not fit the data (as well as the model that includes a specific factor that actually exists). Evidently, we cannot simply remove factors willy-nilly without misrepresenting the data. To solve this problem, the bi-factor (S-1) model introduces new parameters that help the model to fit the data as well or better than the bi-factor model.

Figure 4 in Eid et al.’s article makes it possible for readers who are not statisticians to see the difference between the models. First, we see that the S1 factor has been removed. Second, we see that the meaningful factor names (g = general and s = specific) have been replaced by obscure Greek letters where it is not clear what these factors are supposed to represent. The Greek letter tau (I had to look this up) stands for T = true score. Now true score is not a substantive entity. It is just a misleading name for a statistical construct that was created for a measurement theory that is called classic, meaning outdated. So, the bi-factor (S-1) model no longer claims to measure anything in the real world. There is no g-factor that is based on the assumption that some people will perform better on all cognitive tasks that were developed to measure this common factor. There are also no longer specific factors because specific factors are only defined when we first attribute performance to a general factor and see that other factors also have a common effect on subsets of measures. In short, the model is not a substantive model that aims to measure. It is like creating thermometers without assuming that temperature exists. When I discussed this with Michael Eid years ago, he defended this approach to measurement without constructs with a social-constructionistic philosophy. The basic idea is that there is no reality and that constructs and measures are social creations that do not require validation. Accordingly, the true score factor measures what a researcher wants to measure. We can simply pick two or three correlated measures and the construct becomes whatever produces variation in these three measures. Other researchers can pick other measures and the factors that produce variation in these measures are the construct. This approach to measurement is called operationalism. Accordingly, constructs are defined by measures and intelligence is whatever some researchers shows to measure and call intelligence. Operationalism was rejected by Cronbach and Meehl (1955) and led to the development of measurement models that can be used to examine whether a measure actually measures what it is intended to measure. The bifactor (S-1) model avoids this problem by letting researchers chose measures that define a construct without examining what produces variation in these measures.

“One way to define a G factor in a single-level random experiment is to take one domain as a reference domain. Without loss of generality, we may choose the first domain (k = 1) as reference domain and take the first indicator of this domain (i = 1) as a
reference indicator. This choice of the reference domain and indicator indicator
depends on a researcher’s theory and goals” (Eid et al., 2017, p. 550).

While the authors are transparent about the arbitrary nature of true scores – what is true variance depends on researchers’ choice of which specific factors to remove – they fail to point out that this model cannot be used to test the validity of measures because there is no longer a claim that factors correspond to real-world objects. Now both the measures and the constructs are constructed and we are just playing around with numbers and models without testing any theoretical claims.

Assuming the bi-factor model fits the data, it is easy to explain what the factors in the bi-factor (S-1) model are and why it fits the data. Because the model removed S-1, the true-score factor now represents the g-factor and the S1 factor. The G+S1 factor still predicts variance in the S2 and S3 measures because of the g-variance in the G+S1 factor. However, because the S1-variance in the G+S1 factor is not related to the S2 and S3 measures, the G+S1 factor explains less variance in the S2 and S3 measures than the g-factor in the bi-factor model. The specific factors in the bi-factor (S-1) model with the Greek symbol zeta (?) now predict more variance in the S2 and S3 measures because they not only represent the specific variance, but also some of the general factor variance that is not removed by using the contaminated g+S1 factor to account for shared variance among all measures. Finally, because the zeta factors now contain some g-variance that is shared between S2 and S3 measures, the two zeta factors are correlated. Thus, g-variance is split into g-variance in the g+S1 factor and g-variance that is common to the zeta factors.

Eid et al. might object that I assume the g-factor is real and that this may not be the case. However, this is a substantive question and the choice between the bi-factor model and the bi-factor (S-1) model has to be based on broader theoretical consideration and eventually empirical tests of the competing models. To do so, Eid et al. would have to explain why the two zeta-factors are correlated, which implies an additional common cause for S2 and S3 measures. Thus, the empirical question is whether it is plausible to assume that in addition to a general factor that is common to all measures, S2 and S3 have another common cause that is not shared by S1 measures. The key problem is that Eid et al. are not even proposing a substantive alternative theory. Instead, they argue that there are no substantive questions and that researchers can pick any model they want if it serves their goals. “This choice of the reference domain and indicator indicator depends on a researcher’s theory and goals” (p. 550).

If researchers can just pick and chose models, it is not clear why they could not just pick the standard bi-factor model. After all, the bi-factor (S-1) model is just an arbitrary choice to define the general factor in terms of items without a specific factor. What is wrong with choosing to all for all measures to be influenced by specific factors as in the standard bi-factor model. Eid et al. (2014) claim that this model has several problems. The first claim is that the bi-factor model often produced anomalous results that are often not consistent with the a priori theory. However, this is a feature of modeling, not a bug. What are the chances that a priori theories always fit the data? They whole point of science is to discover new things and new things often contradict our a prior notions. However, psychologists seem to be averse to discovery and have created the illusion that they are clairvoyant and never make mistakes. This narcissistic delusion has impeded progress in psychology. Rather than recognizing that anomalies reveal problems with the a priori theory, they blame the method for these results. This is a stupid criticism of models because it is always possible to modify a model and find a model that fits the data. The real challenge in modeling is that often several models fit the data. Bad fit is never a problem of the method. It is a problem of model misspecification. As I showed, proper exploration of data can produce well-fitting and meaningful models with a g-factor (Schimmack, 2022). This does not mean that the g-factor corresponds to anything real, nor does it mean that it should be called intelligence. However, it is silly to argue that we should prefer models with a general factor and simply pick some measures to create constructs that do not even aim to measure anything real.

Anther criticism of standard bi-factor models is that the loadings (i.e., the effect sizes of the general factor on measures) are unstable. “That means, for example, that the G-factor of intelligence should stay the same (i.e., “general”) when one takes out four of 10 domains of intelligence” (p. 546). Eid et al. point out that this is not always the case.

“Reise (2012), however, found that the G factor loadings can change when domains are removed. This causes some conceptual problems, as it means that G factors as measured in the bifactor and related models are not generally invariant across different sets of domains used to measure them. This can cause problems, for example, in literature reviews or meta-analyses that summarize data from different studies or in so-called conceptual replications in which different domains were used to measure a given G factor, because the G factors may not be comparable across studies.” (p. 546).

This is nonsense. First of all, the problem that results are not comparable across studies is much greater when researchers just start arbitrarily selecting sets of measures as indicators of the general+S factor because the g+S1, g+S2, g+S3 factors are conceptually different. All reall sciences have benefited from unification and standardization of measurement by selecting the best measures. In contrast, only psychologists think we are making progress by developing more and more measures. The use of bi-factor (S-1) models makes it impossible to compare measures because they are all valid measures of researchers’ pet constructs. Thus, use of this model will further impede progress in psychological measurement.

Eid et al. (2014) also exaggerate the extent to which results depend on the choice of measures in the bi-factor model. The more measures are highly correlated and reflect the full range of measures, the more results will be stable and comparable. Moreover, the only reason for notable changes in loadings would be mismeasurement of the general factor because some specific factors were not properly modeled. To support my claim, I used the data from Brunner et al. (2012) who fitted a bi-factor model to 14 measures of g. I randomly split the 14 measures into two sets of 7 and fitted a model with two g-factors and let the two factors correlate. The magnitude of this correlation shows how much inferences about g would depend on the arbitrary selection of measures. The correlation was r = .96 with a 95%CI ranging from .94 to .98. While number-nerds might get a hard-on because they can now claim that results are unstable, p < .05, applied researchers might shrug and think that this correlation is good enough to think they measured the same thing and it is ok to combine results in a meta-analysis.

In sum, the criticism of bi-factor models is all smoke and mirrors to advertise another way of modeling data and to grab market share from the popular bi-factor model that took away market share from hierarchical models. All of this is just a competition among psychometricians to get attention that doesn’t advance actual psychological research. The real psychometric advances are made by psychometricians who created statistical tools for applied researchers like Jorekog, Bentler, and Muthen and Muthen. These tools and substantive theory are all that applied researchers need. The idea that statistical considerations can constrain the choice of models is misleading and often leads to suboptimal and wrong models.

Readers might be a bit skeptical that somebody who doen’t know the Greek alphabet and doesn’t understand some of the statistical arguments is able to criticize trained psychometricians. After all, they are experts and surely must know better what they are doing. This argument ignores the systemic factors that make them do things that are not in the best interest of science. Making a truly novel and useful contribution to psychometrics is hard and many well-meaning attempts will fail. To make my point, I present Eid et al.’s illustration of their model with a study of emotions. Now, I may not be a master psychometrician, but nobody can say that I lack expertise in the substantive area of emotion research and in attempts to measure emotions. My dissertation in 1997 was about this topic. So, what did Eid et al. (2017) find when they created a bi-factor (S-1) measurement model of emotions?

Eid et al. (2017) examined the correlations among self-reports of 9 specific negative emotions. To fit their model to the data, they used the Anger domain as the reference domain. Not surprisingly, anger, fury and rage had high loadings on the true score factor (falsely called the g-factor) and the other negative emotions had low loadings on this factor. This result makes no sense and is inconsistent with all established models of negative affect. All we really learn from this model is that a factor that is mostly defined by anger also explains a small amount of variance in sadness and self-conscious negative emotions. Moreover, this result is arbitrary and any one of the other emotions could have been used to model the misnamed g-factor. As a result, there is nothing general about the g-factor. It is a specific factor by definition. “The
G factor in this model represents anger intensity” (p. 553). But why would we call a specific emotion factor a general factor. This makes no theoretical sense. As a result, this model does not specify any meaningful theory of emotions.

A proper bi-factor or hierarchical model would test the substantive theory that some emotions covary because they share a common feature. The most basic feature of emotions is assumed to be valence. Based on this theory, emotions with the same valence are more likely to co-occur , which results in positive correlations among emotions of the same valence. Hundreds of studies have confirmed this prediction. In addition, emotions also share specific features such as appraisals and action tendencies. Emotions who also share these components are more likely to co-occur than emotions with different or even opposing appraisals. For example, pride and gratitude are based on opposing appraisals of attribution to self or others. A measurement model of emotions might represent these assumptions in a model with one or two general factors for valence (the dimensionality of valence is still debated) and several specific factors. In this model, the general factor has a clear meaning and represents the valence of an emotion. Fitting such a model to the data is useful to test the theory. Maybe the results confirm the model, may be they don’t. Either way, we learn something about human emotions. But if we fit a model that does not include a factor that represents valence and misleadingly label an anger-factor a general factor, we learn nothing, except that we should not trust psychometricians who build models without substantive expertise. Sadly, Eid has actually made good contributions to emotion research in the 1990s that has identified broad general factors of affect that he appears to have forgotten. Accordingly, he would have modeled affect along three general dimensions (Steyer, Schwenkmezger, Notz, & Eid, 1997).

Concluding Rant

In conclusion, the main point of this blog post is that psychometricians benefit from developing ready-to-use, plug-and-play models that applied researchers can use without thinking about the model. The problem is that measurement requires understanding of the object that is being measured. Thermometers do not measure time and clocks are not good measures of weight. As a result, good measurement requires substantive knowledge and custom models that are fitted to the measurement problem at hand. Moreover, measurement models have to be embedded in a broader model that specifies theoretical assumptions that can be empirically tested (i.e., Cronbach & Meehl’s, 1955, nomological network). The bi-factor (S-1) model is unhelpful because it avoids falsification by letting researchers define constructs in terms of an arbitrary set of items. This may be useful for scientists who want to publish in a culture that values confirmation (bias), it is not useful for scientist who want to explore the human mind and need valid measures to do so. For these researchers, I recommend to learn structural equation modeling from some of the greatest psychometricians who helped researchers like me to test substantive theories such as Joreskog, Bentler, and now Muthen and Muthen. They provide the tools, you need to provide the theory and the data and be willing to listen to the data when your model does not fit. I learned a lot.

A Tutorial on Hierarchical Factor Analysis


Psychology lacks solid foundations. Even basic methodological issues are contentious. In this tutorial, I revisit Brunner et al.’s (2012) tutorial on hierarchical factor analysis. The main difference between the two tutorials is the focus on confirmation versus exploration. I show how researchers can use hierarchical factor analysis to explore data. I show that exploratory HFA produces a plausible better fitting model than Brunner et al.’s confirmatory HFA. I also show that it is not possible to use statistical fit to compare hierarchical models to bi-factor models. To do so, I show that my hierarchical model fits the data better than their bi-factor model. Instead, the choice between hierarchical models and bi-factor models is a theoretical question, not a statistical question. I hope that this tutorial will help researchers to realize the potential of exploratory structural equation modeling to uncover patterns in their data that are not predicted a priori.


About a decade ago, Brunner, Nagy, and Wilhelm (2012) published an informative article about the use of Confirmatory Factor Analysis to analyze personality data, using a correlation table of performance scores on 14 cognitive ability tasks as an example.

The discussed four models, but my focus is on the modeling of these data with hierarchical CFA or hierarchical factor analysis. It is not necessary to include confirmatory in the name because only CFA can be used to model hierarchical factor structures. EFA by definition has only one layer of factors. Sometimes researchers conduct hierarchical analysis by using correlations among weighted sum scores (factors) as indicators of lower levels in a hierarchy. However, this is a suboptimal approach to test hierarchical structures. The term confirmatory has also been shown to be misleading because many researchers believe CFA can only be used to test a fully specified theoretical model and any post-hoc modifications are not allowed and akin to cheating. This has stifled the use of CFA because theories are often not sufficiently specified to predict the pattern of correlations well enough to achieve good fit. It is also not possible to use an EFA for exploration and CFA for confirmation because EFA cannot reveal hierarchical structures. So, if hierarchical structures are present, EFA will produce the wrong model and CFA will not fit. Maybe the best term would be hierarchical structural equation modeling, but hierarchical factor analysis is a reasonable term.

One of the advantages of CFA over EFA is that CFA makes it possible to create hierarchical models and to provide evidence for the fit of a model to data. CFA also makes it possible to modify models and to test alternative models, while EFA solutions are determined by some arbitrary mathematical rules. In short, any researcher interested in testing hierarchical structures in correlational data should use hierarchical factor analysis.

Hierarchical models are needed when a single layer of factors is unable to explain the pattern of correlations. It is easy to test the presence of hierarchies in a dataset by fitting a model with a single layer of independent factors to the data. Typically, these models do not fit the data. For example, Big Five researchers abandoned CFA because this model never fit actual personality data. Brunner et al. also show that a model with a single factor (i.e., Spearman’s general intelligence factor) did not fit the data, although all 14 variables show strong positive correlations with each other. This suggests that there is a general factor (suggests does not equal proofs) and it suggests that there are additional relationships among some variables that are not explained by the general factor. For example, vocabulary and similarities are much stronger correlated, r = .755, than vocabulary and digit span, r = .555. The aim of hierarchical factor analysis is to model the specific relationships among subsets of variables.

Brunner et al. present a single hierarchical model with one general factor and four specific factors.

Their Table 2 shows the fit of this model in comparison to three other, non-hierarchical models.

The results show that the hierarchical model fits the data better than a model with a single g-factor, RMSEA = .071 vs. 132 (lower values are better). The results also show that the model fits the data not as well as the first order factor model. The difference between these models is that the hierarchical model assumes that the correlations among the four specific (first-order) factors can be explained by a single higher-order factor, that is, the g-factor. The reduction in fit can be explained by the fact that a single factor is not sufficient to explain this pattern of correlations. The single factor model uses four parameters (the ‘causal effects’ of the g-factor on the specific factor) to predict the six correlations among the four specific factors. This model is simpler as one can see by the comparison of degrees of freedom (73 vs. 71). it would be wrong to conclude from this model comparison that there is no general factor and to reject the hierarchical model based on this statistical comparison. The reason is that it is possible to use the extra degrees of freedom to improve model fit. If two parameters are added to the hierarchical model, fit will be identical to the first-order factor model. There are many ways to improve fit and the choice should be driven by theoretical considerations. For example, one might not want to include negative relationships to fit a model of only positive correlations, although this is by no means a general rule. Modification indices suggested additional positive correlations between the PO and PS factor and the PO and WM factor. Adding these parameters reduced the degrees of freedom by two and produced the same model fit as the model first-order factor model. Thus, it does not seem to be possible to reduce the six correlations among the Big Five factors to a model with fewer than six parameters. Omitting these additional relationships for the sake of a simple hierarchical structure is problematic because the model no longer reflects the pattern in the data. In short, it is always possible to fit the data with a hierarchical model that fits the data as well as a first-order model. Only the assumption that a single-factor accounts for the pattern of correlations among the first-order factors is restrictive and can be falsified. Just like the single factor model, the single-higher-order model will often not fit the data.

A more important comparison is the comparison of the hierarchical model with the nested factor model which is more often called the bi-factor model. Bi-factor models have become very popular among users of CFA and Brunner et al.’s tutorial may have contributed to this trend. The bi-factor model is shown in the next figure.

Theoretically, there is not much of a difference between hierarchical models. Both models assume that variance in an observed variable can be separated into three components. One component reflects variance that is explained by the general factor and leads to correlations with all other variables. One component reflects variance that is explained by a specific factor that is only shared with a subset of variables. And the third component is unique variance that is not shared with any other variable often called uniqueness, disturbance, residual variance, or erroneously error variance. The term error variance can be misleading because unique variance may be meaningful variance that is just not shared with other variables. Brunner et al.’s tutorial nicely shows the unique variance that is only shared among some items in the hierarchical model. This variance is often omitted from figures because it is the residual variance in first-order factors that is not explained by the higher-order factor. For example, in the figure of the hierarchical model, we see that variance in the specific factor VD is separated into variance explained by the g-factor and residual variance that is not explained by the g-factor. This residual variance is itself a factor (an unobserved variable) and this factor conceptually corresponds to the specific factors in the bi-factor model. Thus, conceptually the two models are very similar and may be considered interchangeable representations Yet, the bi-factor model fits the data much better than the hierarchical model, RMSEA = .60 vs. .71. Moreover, the bi-factor model meets (barely) the standard criterion of acceptable model fit for the RMSEA criterion (.06). Thus, readers may be forgiven if they think the bi-factor model is a better model of the data.

Moreover, Brunner et al.’s tutorial suggests that the higher-order model should only be chosen if it fits the data as well as the bi-factor model, which is not the case in this example. Thus, their recommendation leads to the assumption that the better fit of a bi-factor model can be used to decide between these two models. Here I want to show that this recommendation is false and that it is always possible to create a hierarchical model that fits as well as a bi-factor model. The reason is that the poorer fit of the hierarchical model is a cost of its parsimony. That is, it has 73 degrees of freedom compared to 64 degrees of freedom for the bi-factor model. This means that it is possible to add 9 parameters to the hierarchical model to improve fit within a hierarchical structure. Inspection of modification indices suggested that a key problem of the hierarchical model was that it underestimated the relationship of the arithmetic variable on the g-factor. This relationship is mediated (goes through) working memory and is influenced by the relationship of working memory with the g-factor. We can relax this assumption by allowing for a direct relationship between the arithmetic variable and the g-factor. Just this single additional parameter improved model fit and made the hierarchical model fit the data better than the bi-factor model., chi2(70) = 403, CFI = .980, RMSEA = .059. Of course, in other datasets more modifications may be necessary, but the main point generalizes from this example to other datasets. It is not possible to use model fit to favor bi-factor models over hierarchical models. The choice of one or the other model needs to be based on other criteria.

Brunner et al. (2012) discuss another possible reason to favor bi-factor models. Namely, it is not possible in a hierarchical model to relate a criterion variable to the general factor and to all specific factors. The reason is that this model is not identified. That is, it is not possible to estimate the direct contribution of the general factor because the general factor is already related to the criterion by means of the indirect relationships through the specific factors. While Brunner et al. (2012) suggest that this is a problem, I argue that this is a feature of a hierarchical model and that the effect of the general factor is just mediated by the specific factors. Thus, we do not need an additional direct relationship from the general factor to a criterion. We can simply estimate the effect of the general factor by computing the total indirect effect that is implied by the (a) effect of the general factor on the specific factors and (b) the effect of the specific factors on the criterion. In contrast, the bi-factor model provides no explanation about the (implied) causal effect of the general factor on the criterion. For example, high school grades in English might be related to the g-factor because the g-factor influences verbal intelligence (VC) and verbal intelligence is the more immediate cause of better performance in a language class. Thus, a key advantage of a hierarchical model is that it proposes a causal theory in which the effects of broad personality traits on specific behaviors are mediated by specific traits (e.g. effect of extraversion on drug use is mediated by the sensation seeking facet of Extraversion and not the Assertiveness facet of extraversion).

In contrast, a bi-factor model assumes that the general and the specific factors are independent. One plausible reason to use a bi-factor model would be the modeling of method factors like acquiescence or social desirability (Anusic et al., 2009). A method factor would influence responses to all (or most) items, but it would be false to assume that this effect is mediated by the specific factors that represents actual personality constructs. Instead, the advantage of CFA is that it is possible to separate method and construct (trait) variance to get an unbiased estimate of the effect size for the actual traits. Thus, the choice of a hierarchical model or a bi-factor model should be based on substantive theories about the nature of factors and cannot be made based in a theoretical vacuum.

How to Build a Hierarchical Model

There are no tutorials about the building of SEM models because the terminology of confirmatory factor analysis has led to the belief that it is wrong to explore data with CFA. Most of the time, authors may explore the data and then present the final model as if it was created before looking at the data; a practice known as Hypothesizing after Results are known (HARKing, 1998). Other times, authors will use a cookie-cutter model that they can justify because everybody uses the same model, even when it makes no theoretical sense. The infamous cross-lagged panel model is one example. Here I show how authors can create a plausible model that fits the data. There is nothing wrong with doing so and all mature sciences have benefitted from developing models, testing them, testing competing models against each other, and making progress in the process. The lack of progress in psychology can be explained by avoiding this process and pretending that a single article can move from posing a question to providing the final answer. Here I am going to propose a model for the 14 cognitive performance tests. It might be the wrong model, but at least it fits the data well.

The pattern of correlations in Table 1 shows many strong positive correlations. There is also 100-years of research on the positive manifold (positive correlations) among cognitive performance tasks. Thus, a good starting point is the one-factor model that simply assumes that a general factor contributes to scores on all 14 variables. As we already saw, this model does not fit the data. Modern software makes it possible to look for ways to improve the model by inspecting so called Modification Indices. As a first step, we want to look for big (juicy) MI that are unlikely to be chance results. MI are chi-square distributed with 1 df and chi-square values with 1 df are just squared z-scores (scores on the standard normal). Thus, an MI of 25 corresponds to z = 5, which is the criterion used in particle physics to rule out false positives. Surely, this criterion is good enough for psychologists. It is therefore problematic to publish models with MI > 25 because this model ignores some reliable pattern in the data. Of course, MI are influenced by sample size and the effect size may be trivial. However, it is better to include these parameters in the model and to show that they are small and negligible rather than to present an ill-fitting model that may not fit because it omitted an important parameters. Thus, my first recommendation is to inspect modification indices and to use them. This is not cheating. This is good science.

We want to first look for items that show strong residual correlations. A residual correlation is a correlation between the residual variances of two variables after we remove the variance that is shared with the general factor. It is ok to start with the highest MI. In this case, the highest MI is 273, z = 16.5, which is very, very, unlikely to be a chance finding. The suggested effect size for this residual correlation is r = .48. So, this relationship is clearly substantial. To add this finding to a hierarchical model, we do not simply add a correlated residual. Instead, we model this residual correlation as a simple factor in a hierarchical model. Thus, we create the first simple factor in our hierarchical model. Because this simple factor is related to the g-factor (unlike the bi-factor model where it would be independent of the general factor), we can estimate the strength of the factor loadings freely and the model is just identified. Optionally, we could constrain the loadings of the two items, which can sometimes be helpful in the beginning to stabilize the model. The first specific factor in the model represents the shared variance between Letter-Number-Sequencing and and Digit Span. A comparison with Brunner et al.’s model shows that this factor is related to the Working Memory factor in their model. The only difference is that their factor has an additional indicator, Arithmetic. So, we didn’t really do anything wrong by starting totally data-driven in our model.

The first simple-factor in the model is basically a seed to explore whether other items may also be related to this simple factor. Now that we have a factor to represent the correlation between the Letter-Number-Sequencing and Digit Span variables, we could find MI that suggest loadings of other variables on this factor. If this is not the case, other items may still be highly correlated and can be used to add additional simple factors. We can run the revised model and examine the MI. This revealed a significant MI = 48 for Arithmetic. This is also consistent with Brunner’s model and I added it. However, Arithmetic still had a direct loading on the general factor, which is not consistent with their model. In addition, there was a

It is well known that MI can be misleading. For example, the MI for this model suggested a strong loading of Matrix Reasoning on the Working Memory factor. This doesn’t seem implausible. However, Brunner et al.’s model suggests that Matrix Reasoning is related to another specific factor that is not yet included in the model. Therefore, I first looked for other residual correlations that could be the seed for additional specific factors. Also, MI for additional residual correlations were larger. The largest MI was for the residual correlation between Matrix Reasoning and Information. This is inconsistent with Brunner et al.’s model that suggested these items belong to different specific factors. However, to follow the data-driven approach, I used this pair as a seed for a new specific factor. The next big correlated residual was found for Block Design and Comprehension. Once more, this did not correspond to a specific factor in Brunner et al.’s model, but I added it to the model. The next run showed a still sizeable MI (22) for Digital-Symbol-Coding on the Working-Memory factor. So, I also added this parameter, but the effect size was small, b = .19. Thus, Brunner et al.’s model is not wrong, but omitting this parameter lowers overall fit. As there we no other notable MI for the Working Memory factor, I looked for the correlated residual with the highest MI as a seed for another specific factor. This was the correlated residual for Digital-Symbol-Coding and Symbol Search. This correlated residual is represented by the PS factor in Brunner et al.’s model. The next high modification index suggested a factor for Vocabulary and Comprehension. This is reflected in the VC factor in Brunner’s model. The MIs of this model suggested a strong loading of Similarity on the VC factor (MI = 297), which is also consistent with Brunner et al.’s model. Another big MI (256) suggested that Information also loads on the VC factor, again consistent with Brunner et al.’s model. The next inspection of MIs, however, suggested an additional moderate loading of Arithmetic on the VC factor that is not present in Brunner et al.’s model. I added it to the model and model fit improved, CFI = .976 vs. 973). Although the loading was only b = .29, not including it in a model lowers fit. The next round suggested another weak loading of letter-number sequencing on the PS factor. Adding this parameter further increased model fit, CFI = .978 vs. 976, although the effect size was only b = .23. The next big MI was for the correlated residual between Object-Assembly and Block-Design (MI = 82). This relationship is represented by the PO factor in Brunner et al.’s model.

At this point, the model already had good overall fit, CFI = .982, RMSEA = .055, although three variables were still unrelated to the specific factors, namely Picture Completion, Picture Arrangement, and Matrix Reasoning. It is possible that these variables are directly related to the general factor, but it is also possible to model the relationship as being mediated by specific factors. Theory would be needed to distinguish between these models. To examine possible mediating specific factors, I removed the direct relationship and examined the MI of this model with bad fit to find mediators. Consistent with Brunner et al.’s model, Picture-Comprehension, Picture-Arrangement, and Matrix-Reasoning showed the highest MI for the PO factor, although the MI for the g-factor was equally high or higher. Thus, I added these three variables as indicators to the PO factor.

At this point, the biggest MI (76) suggested a correlated residual between Object-Assembly and Block-Design. However, these two variables are already indicators of the PO factor. The reason for the residual correlation is that the PO factor also has to explain relationships with the other variables that are related to the PO factor. Apparently, Object-Assembly and Block-Design are more strongly related than the loadings on the PO factor predict. To allow for this relationship within a hierarchical model it is possible to add another layer in the hierarchy and specify a sub-PO factor that accounts for the shared variance between Object-Assembly and Block-Design. This modification improved model fit, CFI = .977 vs. .981. The RMSEA was .045 and in the range of acceptable fit (.00 to .06).

At this point most MI are below 25, suggesting that further modifications are riskier, but also only minor modifications that will not have a substantial influence on the broader model. Some of the remaining MI suggested negative relationships. While it is possible that some abilities are also negatively related to other, I decided not to include them in the model. One MI (31) suggested a correlation between the VC and WM factors. A simple way to improve fit would be to add this correlation to the model. However, there are other ways to accommodate this suggestions. Rather than allowing for a correlation between factors, it is possible to add secondary loadings of VC variables on the WM factor and vice versa. The MI for the correlation is often bigger because it combines modifications of several specific modifications. Thus, it is a theoretical choice which approach should be taken. One simple rule would be to add the correlation if all of the secondary loadings show the same trend and to allow for secondary loadings if the pattern is inconsistent. In this case, all of the VC variables showed a significant loading on the WM factor and I used these secondary loadings to modify the model.

There were only a few MI greater than 25 left. Two were correlated residuals of Arithmetic with Information and with Matrix Reasoning. Arithmetic was already complex and related to several specific factors. I therefore considered the possibility that this variable reflects several specific forms of cognitive abilities. The finding of correlated residuals suggested that even some of the unique variance of other variables is related to the Arithmetic task. To model this, I treated Arithmetic as a lower order construct that could be influenced by several of the other 14 variables. Exploration identified four variables that seemed to contribute uniquely to variance in the Arithmetic variable. I therefore removed Arithmetic as an indicator of specific factors and added it as a variable that is predicted by these four variables, namely Matrix-Reasoning, b = .32, Information, b = .28, Letter-Number-Sequencing, b = .18, and Digit Span, b = .13.

There was only one remaining MI greater than 25. This was a correlation between the PO-factor and the Information variable. Although the loading of Information on the PO factor did not meet the threshold, it would also improve model fit and be more interpretable. Thus, I added this parameter. The loading was weak, b = .12 and the value was not significant at the .01 level. The MI for the correlation between Information and the PO-factor was now below the 25 threshold. As this parameter is not really relevant, I decided to remove it and use this model as the final model. The fit of this final model met standard criteria of model fit, CFI = .990, RMSEA = .042. This fit is better than the fit of Brunner et al.’s hierarchical model, CFI = .970, RMSEA = .071, and their favored bi-factor model, CFI = .981, RMSEA = .060. While model fit cannot show that a model is the right model, model comparison suggests that models with worse fit are not the right model. This does not mean that these models are entirely wrong. Well-fitting models are likely to share some common elements. Thus, model comparison should not be based on a comparison of fit indices, but also compare the actual differences between the models. To make this comparison easy, I show how my model is similar and different to their model.

The Figure shows that key aspects of the two models are similar. One differences are a couple of weak secondary loadings of two VC variables on the WM factor. Including these parameters merely shows where the simpler model produces some misfit to the data. The second difference is the additional relationship between Object Assembly and Block Design. The models would look even more similar if this relationship were modeled by simply adding a correlated residual between these two variables. The interpretation is the same. There is some factor that influences performances on these two tasks, but not on the other 12 tasks. The biggest difference is the modeling of Arithmetic. In my model, Arithmetic shares variance with digit span and letter-Number Spacing and this shared variance includes variance that is explained by the WM factor and variance that is unique to these variables. In Brunner et al.’s model, Arithmetic is only related to the variance that is shared by Digit-Span, Letter-Number, not the unique variance of these two variables. In addition, my model shows even stronger relationships of Arithmetic with Information and Matrix Reasoning. These differences may have implications for theories about this specific task, but have no practical implications for theories that try to explain the general pattern of correlations. In conclusion, an exploratory structural equation modeling approach produces a hierarchical model that is essentially the same as a published model. Thus, conventions that prevent researchers from exploring their data with SEM are hindering scientific progress. One could argue that Brunner et al.’s hierarchical model is equally atheoretical and similarity merely shows that both models are wrong. To make this a valid criticism, intelligence researchers would have to specify a theoretical model and then demonstrate that it fits the data at least as well as my model. Merely claiming expertise in a substantive area is not enough. The strength of SEM is that it can be used to test different theories against each other. Eventually, more data will be needed to pit alternative models against each other. My model serves as a benchmark for other models even if it was created by looking at the data first. Developing theories of data is not wrong. In fact, it is not clear how one would develop a theory without having any data . And the worst practice in psychology is when researchers have theories without data and then look for data that confirm their theory and suppress contradictory evidence. This practice has led to the replication crisis in experimental social psychology.

Reliability and Measurement Error

Hierarchical factor analysis can be used to examine structures of naturally occurring objects (e.g., the structure of emotions, Shaver et al. 1987) or to examine the structure of man-made objects. CFA is often used for the latter purpose, when researchers examine the psychometric properties of measures that they created. In the present example, the 14 variables are not a set of randomly selected tasks. Rather, they were developed to measure intelligence. To be included as a measure of intelligence, a measure had to demonstrate that it correlates with other measures under the assumption that intelligence is a general factor that influences performance on a ranger of tasks. It is therefore not surprising that the 14 variables are strongly correlated and load on a common factor. The reason is that they were designed to be measures of this general factor. Moreover, intelligence researchers are not only interested in correlations among measures and factors. They also want to use these measures to assign scores to individuals. The assignment of scores to individuals is called assessment. Ideally, we would just look up individuals’ standing on the general factor, but this is not possible. A factor reflects shared variance among items, but we only know individuals standing on the observed variables , not the variance that is shared. To solve this problem, researchers average individuals’ scores on individual variables and use these sum-scores as a measure of individuals’ standing on the factor. The use of sum-scores creates measurement error because some of the variance in sum-scores reflects the variance in specific factors and the unique variances of the variables. Averaging across several variables can reduce measurement error because averaging reduces the influence of specific factors and residual variances. Brunner et al. (2020) discuss several ways to quantify the amount of construct variance (the g-factor) in sum-scores of the 14 items.

A simple way to explore this question is to add sum-scores to the model with a formative measurement model, where a new variable is regressed on the observed variables and the residual variance is fixed to zero. This new variable represents the variance in actual sum-scores and it is not necessary to actually create them and add them to the variable set. To examine the influence of the general factor, it is possible to use mediation analysis because the effect of the g-factor on the sum score is fully mediated by the 14 variables. This model shows that the standardized effect of the g-factor on the sum-scores is r = .96, which implies 92% of the variance is explained by the g-factor. It is also possible to examine the sources of the remaining 8% of variance that does not reflect the g-factor by examining mediated paths from the unique variances in the specific factors. The unique variance in the WM, PS, and PO factor explained no more than 1% each, but the unique VC variance explained 3% of the variance in sum scores. The remaining 2% can be attributed to unique variances in the 14 variables. Overall, these results suggest that the sum score is a good measure of the g-factor. This finding does not tell us what the g-factor is, but it does suggest that sum-scores are a highly valid measure of the g-factor.

The indirect path coefficients can be used to shorten a scale by eliminating variables that make a small contribution to the total effect. Following this approach I removed four variables, Arithmetic, Information, Comprehension, and Letter-Number Sequencing. The effect of the g-factor on this sum-score was as high, b = .96, as for the total scale. It is of course possible that this result is unique to this dataset, but the main point is that HFA can be used to determine the contribution of factors to sum scores in order to create measures and to examine their construct validity under the assumption that a factor corresponds to a construct (Cronbach & Meehl, 1955). To support the interpretation of factors, it is necessary to examine factors in relationship to other constructs, which also can be done using SEM. Moreover, evidence of validity is by no means limited to correlational evidence. Experimental manipulations can be added to an SEM model to demonstrate that a manipulation changes the intended construct and not some other factors. This cannot be done with sum scores because sum-scores combine valid construct variance and measurement error. As a result, validation of measures requires specification of a measurement model in which constructs are represented as factors and to demonstrate that factors are related to other variables as predicted.


This tutorial on hierarchical factor analysis was written in response to Brunner et al’s (2012) tutorial on hierarchically structured constructs. There are some notable differences between the two tutorials. First, Brunner et al. (2012) presented a hierarchical model without detailed explanations of the model. The model assumes that the 14 measures are first of all related to four specific factors and that the four specific factors are related to one general factor. This simple model implicitly assumes no secondary loadings and no additional relationships among first-order factors. These restrictive assumptions are unlikely to be true and often lead to bad fit. While Brunner et al. (2012) claim that their model has adequate fit, the RMSEA value of .071 is high and higher than the fit of the bi-factor model. Thus, this model should not be accepted without exploration of alternative models. The key difference between Brunner et al.’s tutorial and my tutorial is that Brunner et al. (2012) imply that their hierarchical model is the only possible hierarchical model that needs to be compared to models that do not imply a hierarchy like the bi-factor model. This restrictive view is shared with authors who claim simple CFA models should not include correlated residuals or secondary loadings. I reject this confirmatory straight-jacked that is unrealistic and leads to low fit. I illustrate how hierarchical models can be created that actually fit data. While this approach is exploratory, it has the advantage that it can produce theoretical models that actually fit data. These models can then be tested in future studies. Another difference between the two tutorials is that I present a detailed description of the process of building a hierarchical model. This process is often not explicitly described because CFA is presented a tool that examines the fit between an a priori theory and empirical data. Looking at the data and making a model fit data is often considered cheating. However, it is only cheating when authors fail to disclose that retrofitting of a model. Honest exploratory work is not cheating and is an integral part of science. After all, it is not clear how scientists could develop good theories in the absence of data that constrain theoretical predictions.

Another common criticism of factor models in general is that factor models imply causality while the data are only correlational. First of all, experimental manipulations can be added to models to validate factors, but sometimes this is not possible. For example, it is not clear how researchers could manipulate intelligence to validate an intelligence measure. Even if such manipulations were possible , the construct of intelligence would be represented by a latent variable. Some psychologists are reluctant to accept that factors correspond to causes in the real world. Even these skeptics have to acknowledge that test scores are real and something caused some individuals to provide a correct answer whereas others did not. This variance is real and it was caused by something. A factor first of all shows that the causes that produce good performance on one task are not independent of the causes of performance on another task. Without using the loaded term intelligence, the g-factor merely shows that all of the 14 have some causes in common. The presence of a g-factor does not tell us what this cause it, nor does it mean that there is a single cause. It is well-known that factor analysis does not reveal what a factor is and that factors may not measure what they were designed to measure. However, factors show which observed measures are influenced by shared causes when we can rule out the possibility of a direct relationship. That is, doing well on one one task does not cause participants to do well on another task. In a hierarchical model, higher-order factors represent causes that are shared among a large number of observed measures. Lower-order factors represent causes that are shared by a smaller number of factors. Observed measures that are related to the same specific factor are more likely to have a stronger relationship because they share more causes.

In conclusion, I hope that this tutorial encourages more researchers to explore their data using hierarchical factor analysis and to look at their data with open eyes. Careful exploration of the data may reveal unexpected results that can stimulate thoughts and propose new theories. The development of new theories that actually fit data may help to overcome the theory crisis in psychology that is based on an unwillingness and inability to falsify old and false theories. The progress of civilizations is evident in the ruins of old ones. The lack of progress in psychology is evident in the absence of textbook examples of ancient theories that have been replaced by better ones.

When Dumb People Say Stupid Things about Personality and Intelligence

Personality psychologists have conducted hundreds of studies that relate various personality measures to each other. The good news about this research is that it is relatively easy to do and doesn’t cost very much. As a result, sample sizes are big enough to produce stable estimates of the correlations between these measures. Moreover, personality psychologists often study many correlations at the same time. Thus, statistical significance is not a problem because some correlations are bound to be significant.

The key problems with personality psychology is that many studies are mono-method studies. This often leads to spurious correlations that are caused by method factors (Campbell & Fiske, 1959). For example, self-report measures often correlate with each other because they are influenced by socially desirable responding. It is therefore interesting to find articles that used multiple-methods which allows it to separate method factors and personality factors.

One common finding from multi-method studies is that the Big Five personality traits often appear correlated when they are measured with self-reports, but not when they are measured with multiple methods (i.e., multiple raters) (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). Furthermore, the correlations among self-ratings of the Big Five are explained by an evaluative or desirability factors.

Despite this evidence, some personality psychologists argue that the Big Five are related to each other by substantive traits. One model assumes that there are two higher-order factors. One factor produces a positive correlation between extraversion and openness and another factor produces positive correlations between Emotional Stability (low Neuroticism), Agreeableness, and Conscientiousness. These two factors are supposed to be independent (DeYoung, 2006). Another model proposes a single higher-order factor that is called the General Factor of Personality (GFP). This factor was originally proposed by Musek (2007) and then championed by the late psychologists Rushton. Plank suggested that bad theories die after their champion dies, but in this case Dimitri van der Linden has taken it upon himself to keep the GFP alive. I have met Dimitri at a conference many years ago and discussed the GFP with him, but evidently my arguments fell on deaf ears. My main point was that you need to study factors with factor analysis. A simple sum score of Big Five scales is not a proper way to examine the GFP because this sum score also contains variance of the specific Big Five factors. Apparently, he is too stupid or lazy to learn structural equation modeling to use CFA in studies of the GFP.

Instead, he computes weighted sum scores as indicators of factors and uses these sum scores to examine relationships of higher-order factors with intelligence.

The authors then find that the Plasticity scale is related to self-rated and objective measures of intelligence and interpret this as evidence that the Plasticity factor is related intelligence. However, the Plasticity scale is just an average of Extraversion and Openness and it is possible that this correlation is driven by the unique variance in Openness rather than the shared variance between Openness and Extraversion that corresponds to the Plasticity factor. In other words, the authors fail to examine how higher-order factors are related to intelligence because they do not examine this relationship of factors, which requires structural equation modeling. Fortunately, they provided the correlations among the measures in their two studies and I was able to conduct a proper test of the hypothesis that Plasticity is related to intelligence. I fitted a multiple-group model to the correlations among the Big Five scales (different measures were used in the two studies), the self-report of intelligence, and the scores on Cattell’s IQ test. Overall model fit was acceptable, CFI = .943, RMSEA = .050. Figure 1 shows the model. First of all, there is no evidence of Stability and Plasticity as higher-order factors, which would produce correlations between Extraversion (EE) and Openness (OO) and correlations between Neuroticism (NN), Agreeableness (AA), and Conscientiousness (CC). Instead, there was a small positive correlation between Neuroticism and Openness and between Agreeableness and Conscientiousness. There was evidence of a general factor that influenced self-ratings of the Big Five (N, E, O, A, C) and self-ratings of intelligence (sri), although the effect size for self-reported intelligence was surprisingly small. This might be due to the assessment of intelligence that may have led to more honest reporting. Most important, the general factor (h) was unrelated to performance on Cattell’s test. This shows that the factor is unique to the method of self-ratings and supports the interpretation of this factor as a method factor (Anusic et al., 2009). Finally, self-ratings and objective test scores reflect a common factor which shows some valid variance in self-ratings. This has been reported before (Borenau & Liebler, 1992). The intelligence factor was related to Openness, but not with Extraversion, which is also consistent with other studies that examined the relationship between personality and IQ scores. Evidently, intelligence is not related to Plasticity because plasticity is the shard variance between Extraversion and Openness and there is no evidence that this shared variance exist and no evidence that Extraversion is related to intelligence.

These results show that van der Linden and colleagues came to the wrong conclusion because they did not analyze their data properly. To make claims about higher-order factors, it is essentially to use structural equation modeling. Structural equation modeling shows that the Plasticity and Stability higher-order factors are not present in these data (i.e., the pattern of correlations is not consistent with this model) and it shows that only Openness is related to intelligence which can also be seen by just inspecting the correlation tables. Finally, the authors misinterpret the relationship between the general factor and self-rated intelligence. “First, their [high GFP individuals] intellectual self-confidence might be partly rooted in their actual cognitive ability as SAI and g shared some variance in explaining Plasticity and the GFP” (p. 4). This is pure nonsense. As is clearly visible in Figure 1, the general factor is not related to scores on Cattell’s test and as a result it cannot be related to the shared variance between test scores and self-rated intelligence that is reflected in the i factor in Figure 1. There is no path linking the i-factor with the general factor (h). Thus, individuals standing on the h-factor is independent of their actual intelligence. A much simpler interpretation of the results is that self-rated intelligence is influenced by two independent factors. One is rooted in accurate self-knowledge and correlates with objective test scores and the other is rooted in overly positive ratings on desirable traits and is related to the tendency to do so across all traits. Although this plausible interpretation of the results is based on a published theory of personality self-ratings (Anusic et al., 2009), the authors simply ignore it. This is bad science, especially in correlational research that requires testing of alternative models.

In conclusion, I was able to use the authors data to support an alternative theory that they deliberately ignored because it challenges the authors’ prior beliefs. There is no evidence for a General Factor of Personality that gives some people a desirable personality and others an undesirable one. Instead, some individuals exaggerate their positive attributes in self-reports. Even if this positive bias (self-enhancement) were beneficial, it is conceptually different from actually possessing these attributes. Being intelligent is not the same as thinking that one is intelligent, and thinking that one understands personality factors is different from actually understanding personality factors. I am not the first critic of personality psychologists’ lack of clear thinking about factors (Borsboom, 2006).

“In the case of PCA, the causal relation is moreover rather uninteresting; principal component scores are “caused” by their indicators in much the same way that sumscores are “caused” by item scores. Clearly, there is no conceivable way in which the Big Five could cause subtest scores on personality tests (or anything else, for that matter), unless they were in fact not principal components, but belonged to a more interesting species of theoretical entities;
for instance, latent variables. Testing the hypothesis that the personality traits in question are
causal determinants of personality test scores thus, at a minimum, requires the specification
of a reflective latent variable model (Edwards & Bagozzi, 2000). A good example would be a
Confirmatory Factor Analysis (CFA) model.”

In short, if you want to talk about personality factors, you need to use CFA and examine the properties of latent variables. It is really hard to understand why personality psychologists do not use this statistical tool when most of their theories are about factors as causes of behavior. Borsboom (2006) proposed that personality psychologists dislike CFA because it can disprove theories and psychologists seem to have an unhealthy addiction to confirmation bias. Doing research to find evidence for one’s beliefs may feel good and may even lead to success, but it is not science. Here I show that Plasticity and Stability do not exist in a data-set and the authors do not notice this because they treat sumscores as if they were factors. Of course, we can average Extraversion and Openness and call this average Plasticity, but this average is not a factor. To study factors, it is necessary to specify a reflective measurement model, and there is a risk that a model may not fit the data. Rather than avoiding this outcome, it should be celebrated because falsification is the root of scientific progress. Maybe the lack of theoretical progress in personality psychology can be attributed to an avoidance to disconfirm existing theories.

Psychopathic Measurement: The Elementary Psychopathy Scale Lacks Construct Validity


In this blog post (pre-print), I examine the construct validity of the Elementary Psychopathy Assessment Super-Short Format scale (EPA-SSF) with Rose et al.’s (2022) open data. I examine construct validity by means of structural equation modeling. I find that the proposed 3-factor structure does not fit the data and find support for a four-factor structure. I also find evidence for a fifth factor that reflects a tendency to endorse desirable traits more and undesirable traits less. I find that most of the reliable variance in the scale scores is predicted by this factor, whereas substantive traits play a small role. I also show that the general factor contributes to the prediction of self-reported criminal behaviors. I find no evidence to support the inclusion of Emotional Stability in the definition of psychoticism. Finally, I raise theoretical objections about the use of sum scores to measure multi-trait constructs. Based on these concerns, I argue that the EPA-SSF is not a valid measure of psychoticism and that results based on this measure do not add to the creation of a nomological net surrounding the construct of psychoticism.


Measurement combines invention and discovery. The invention of microscopes made it possible to see germs and to discovery the causes of many diseases. Turning a microscope to the skies allowed Galileo to make new astronomical discoveries. In the 20th century, psychology emerged as a scientific discipline and the history of psychology is marked by the development of psychological measures. Nowadays, psychological measurement is called psychometrics. Unfortunately, psychometrics is not a basic, fundamental part of mainstream psychological science. Instead, psychometrics is mostly taught in education departments and used for applied purposes of educational testing. As a result, many psychologists who use measures in their research have very little understanding of psychological measurement.

For any measure to be able to discover new things, it has to be valid. That is, the numbers that are produced by a measure should reflect mostly variation in the actual objects that are being examined. Science progresses when new measures are invented that can produce more accurate, detailed, and valid information about the objects that are being studied. For example, developments in technology have created powerful microscopes and telescopes that can measure small objects in nanometers and galaxies billions of lightyears away. In contrast, psychological measures are more like kaleidoscopes. They show pretty images, but these images are not a reflection of actual objects in the real world. While this criticism may be harsh, it is easily supported by the simple fact that psychologists do not quantify validity of their measures and that there are often multiple measures that claim to measure the same construct even though they are only moderately correlated. For example, at least eight different measures claim to be measures of narcissism without a clear definition of narcissism and without validity information that makes it possible to pick the best measure of narcissism (Schimmack, 2022).

A fundamental problem in psychological science is the way scientific findings are produced. Typically, a researcher has an idea, conducts a study, and then publishes results if the results support their initial ideas. This bias is easily demonstrated by the fact that 95% of articles in psychology journals are supportive of researchers’ ideas, which is an unrealistically high success rate (Sterling, 1959; Sterling et al., 1995). Journals are also reluctant to publish work that is critical of previous articles, especially if these articles are highly cited, and authors are often asked to be expert reviewers of work that is critical of their work. It would take extra-human strength to be impartial in these reviews, and these self-serving reviews are often the death of critical work. Thus, psychological science lacks the basic mechanism that drives scientific progress: falsification of bad theories or learning from errors. Evidence for the lack of self-correction that is a necessary element of science was produced during the past decade that was called the replication crisis, when researchers dared to publish replication failures of well-known findings. However, while the replication crisis has focused on empirical tests of hypotheses, criticism of psychological measures has remained relatively muted (Flake & Fried, 2020). It is time to use the same critical attitude that fueled the replication crisis and apply it to psychological measurement. I predict that many of the existing measures lack sufficient construct validity or are redundant with other measures. As a result, progress in psychological measurement would be marked by a consolidation of measures that is based on a comparison of measures’ construct validity. As one of my favorite psychologists once observed in a different context, in science “less is more” (Cohen, 1990), and this is also true for science. While cuckoo’s clocks are fun, they are not used for scientific measurement of time.


A very recent article reviewed the literature on psychopathy (Patrick, 2022). The article describes psychopathy as a combination of three personality traits.

A conceptual framework that is helpful for assimilating different theoretical perspectives and integrating findings across studies using different measures of psychopathy is the triarchic model (Patrick et al. 2009, Patrick & Drislane 2015b, Sellbom 2018). This model characterizes psychopathy in terms of three trait constructs that correspond to distinct symptom features of psychopathy but relate more clearly to biobehavioral systems and processes. These are (a) boldness, which encompasses social dominance, venturesomeness, and emotional resilience and connects with the biobehavioral process of threat sensitivity; (b) meanness, which entails low empathy, callousness, and aggressive manipulation of others and relates to biobehavioral systems for affiliation (social connectedness and caring); and (c) disinhibition, which involves boredom proneness, lack of restraint, irritability, and irresponsibility and relates to the biobehavioral process of inhibitory control. (p. 389).

This definition of psychopathy raises several questions about the relationship between boldness, meanness, and disinhibition and psychopathy that are important for valid measurement of psychopathy. First, it is clear that psychopathy is a formative construct. That is psychopathy is not a common cause of boldness, meanness, and disinhibition and the definition imposes no restrictions on the correlation among the three traits. Boldness could be positively or negatively correlated with meanness or they could be independent. In fact, models of normal personality would predict that these three dimensions are relatively independent because boldness is related to extraversion, meanness is related to low agreeableness and disinhibition is related to low conscientiousness and these three broader traits are independent. As a result, the definition of psychopathy as a combination of three relatively independent traits implies that psychopaths are characterized by high levels on all three traits. This definition raises questions about the combination of information about the three traits to produce a valid score that reflects psychopathy. However, in practice scores on these dimensions are often averaged without a clear rational for this scoring method.

Patrick’s (2022) review also points out that multiple measures aim to measure psychopathy with self-reports. “multiple scale sets exist for operationalizing biobehavioral traits corresponding to boldness, disinhibition, and meanness in the modality of self-report (denoted in Figure 3 by squares labeled with subscript-numbered S’s)” (p. 405). It is symptomatic for the lack of measurement theories that Patrick uses the term operationalize instead of measurement because psychometricians have rejected the notion of operational measurement over 50 years ago (Chronbach & Meehl, 1955). The problem with operationalism is that every measure is by definition a valid measure of a construct because the construct is essentially defined by the measurement instrument. Accordingly, a psychopathy measure is a valid measure of psychopathy and if different measures produce different scores, they simply measure different forms of psychopathy. However, few researchers would be willing to accept that their measure is just an arbitrary collection of items without a claim to measure something that exists independent of the measurement instrument. Yet, they also fail to provide evidence that their measure is a valid measure of psychopathy.

Here, I examine the construct validity of one self-report measure of psychopathy using the open data shared by the authors who used this measure, namely the 18-item short form of the Elementary Psychopathy Assessment (EPA. Lynam, Gaughan, Miller, Miller, Mullins-Sweatt, & Widiger, 2011; Collison, Miller, Gaughanc, Widiger, & Lynam, 2016). The data were provided by Rose, Crowe, Sharpe, Til, Lynam, & Miller, 2022).

Rose et al.’s description of the EPA is brief.

The EPA-SSF (Collison et al., 2016) yields a total psychopathy score (alpha = .70/.77) as
well as scores for each of three subscales: Antagonism (alpha = 61/.72), Emotional Stability (alpha = .66/.65), and Disinhibition (alpha = .68/.71).

The description suggests that the measure aims to measure psychopathy as a combination of three traits, although boldness (high Extraversion) is replaced with Emotional Stability (Low Neuroticism).

Based on their empirical findings, Rose et al. (2022) conclude that two of the three traits predict the negative outcomes that are typically associated with psychopathy. “It is the ATM Antagonism and Impulsivity [Disinhibition] domains that are most responsible for
psychopathy, narcissism, and Machiavellianism’s more problematic
correlates – antisocial behavior, substance use, aggression, and risk taking” (p. 10). In contrast, emotional stability/boldness are actually beneficial. “Conversely, the Emotional
Stability and Agency factors are more responsible for the more adaptive aspects including self-reported political and interpersonal skill” (p. 11).

This observation might be used to modify and construct of narcissism in an iterative process known as construct validation (Cronbach & Meehl, 1955). Accordingly, disconfirming evidence can be attributed to problems with a measure or problems with a construct. In the present case, the initial assumption appears to be that psychopaths have to be low in Neuroticism or bold to commit horrible crimes. Yet, the evidence suggests that there also can be neurotic psychopaths who are violent and may the cause of violence is a combination of high neuroticism (especially anger) and low conscientiousness (lack of impulse control). We might therefore limit the construct of psychopathy to low agreeableness and low conscientiousness, which would be consistent with some older models of psychopathy (van Kampen, 2009). Even this definition of psychopathy can be critically examined given the independence of these two traits. If the actual personality factors underlying anti-social behaviors are independent, we might want to focus on these independent causes. The term psychopath would be akin to the word girl that simply describes the combination of two independent traits; disagreeable and impulsive or young and female. The term psychopath does not add anything to the theoretical understanding of anti-social behaviors because it is defined as nothing more than being mean and impulsive.

Does the EPA-SSF measure Antagonism, Emotional Stability, Disinhibition

The EPA was based on the assumption that Psychotocism is related to 18 specific personality traits and that these 18 traits are related to four of the Big Five dimensions. Empirical evidence supported this assumption. Five traits were related to low Neuroticism, namely Unconcerned, Self-Contentment, Self-Contentment, Self-Assurance, Impulsivity, and Invulnerability, and one was related to high Neuroticism (Anger). Evidently, a measure that combines items that reflect the high and low pole of factor is not a good measure of the factor. Another problem is that several of these scales had notable secondary loadings on other Big Five factors. Anger loaded more strongly and negatively on Agreeableness than on Neuroticism. and Self-Assurance loaded more highly on Extraversion. Thus, it is a problem to refer to the Emotional Stability scale as a measure of Emotional Stability. If the theoretical model assumes that Emotional Stability is a component of Psychoticism, it would be sufficient to use a validated measure of Emotional Stability to measure this component. Presumably, the choice of different items was motivated by the hypothesis that the specific item content of the EPA scales adds to the measurement of psychoticism. In this case, however, it is misleading to ignore this content in the description of the measure and to focus on the shared variance among items.

Another six items loaded negatively on Agreeableness, namely Distrust, Manipulation, Self-Centeredness, Opposition, Arrogance, and Callousness. The results showed that these six items were good indicators of Agreeableness. A minor problem is to call this scale antagonism, which is a common term among personality disorder researchers. It is also a general understanding that Antagonism and Agreeableness are strongly negatively correlated without any evidence of discriminant validity. Thus, it may be confusing to label this factor by a different name, when this name merely refers to the low end of Agreeableness (Disagreeableness). Aside from this terminological confusion, it is a question whether the specific item content of the Antagonism scale adds to the definition of psychoticism. For example, the item “I could make a living as a con artist” may not just be a measure of agreeableness, but also measure specific aspects of psychoticism.

Another three constructs were clearly related to low conscientiousness, namely Disobliged, Impersistence, and Rashness. A problem occurs when these constructs are measured with a single item because exploratory factor analysis may fail to identify factors that have only three indicators, especially when factors are not independent. Once again, calling this factor Disinhibition can create confusion if it is not stated clearly that Disinhibition is merely a label for low Conscientiousness.

Most surprising is the finding that the last three constructs were unrelated to the three factors that are supposed to be captured with the EPA. Coldness was related to low Extraversion and low Agreeableness. Dominance was related to high Extraversion and low Agreeableness. Finally, Thrill-Seeking had low loadings on all Big Five factors. It is not clear why these items would be retained in a measure of psychoticism unless it is assumed that the specific content of these scales adds to the measurement and therefore the operational definition of psychoticism.

In conclusion, the EPA is based on a theory that psychoticism is a multi-dimensional construct that reflects the influence of 18 narrow personality traits. Although these narrow traits are not independent and are related to four of the Big Five factors, the EPA psychoticism scale is not identical to a measure that combines Emotional Stability, low agreeableness, and Low Conscientiousness.

Lynam et al. (2011) also examined how the 18 scales of the EPA are related to other measures of anti-social behaviors. Most notable, all of the low Neuroticism scales showed no relationship with anti-social behavior. The only Neuroticism-related scale that was a predictor was Anger, but Anger not only reflects high Neuroticism, but also low Agreeableness. These results raise questions about the inclusion of Emotional Stability in the definition of Psychoticism. Yet, the authors conclude “overall, the EPA appears to be a promising new instrument for assessing the smaller, basic units of personality that have proven to be important to the construct of psychopathy across a variety of epistemological approaches” (p. 122). It is unclear what evidence could have changed the authors mind that their newly created measure is not a valid measure of psychoticism or that their initial speculation about the components of psychoticism was wrong. The use of an 18-item scale in 2022 shows that the authors have never found evidence to revise their theory of psychoticism or improved the measure of psychoticism. This is therefore important to critically examine the construct validity of the EPA from an independent perspective. I focus on the 18-item EPA-SSF because this scale was used by Rose et al. (2022) and I was able to use their open data.

Collins et al. (2016) conducted exploratory factor analyses with Promax rotation to examine the factor structure of the 18-item EPA-SSF. Although Lynam et al. (2011) demonstrated that items were related to four of the Big Five dimensions, they favored a three-factor solution. The problem of exploratory analysis is that they provide no evidence of the fit of a model to the data. Another problem is that factor solutions are atheoretical and influenced by item selection and arbitrary rotations. This might explain why the factor solution did not identify the expected factors. I conducted a replication of Collins’s EFAs with Rose et al.’s (2022) data from Study 1 and Study 2. I conducted these analyses in MPLUS, which provides fit indices that can be used to evaluate the fit of a model to the data. I used the Geomin rotation because this default method produces more fit indices and the corresponding fit index (RMSEA) is the same. Evidently, this choice of a rotation method has no influence on the validity of the results because neither of these rotation methods is based on substantive theory about Psychoticism.

The results are consistent across the two datasets. RMSEA and CFI favor 5-factors, while the criterion that favors parsimony the most, BIC, favors 4 factors. A three-factor model does not have bad fit, but it does fail to capture some of the structure in the data.

To examine the actual factor structure. I first replicated Collins et al.’s EFA using a three-factor structure and Promax rotation. Factor loadings greater than .4 (16% explained variance) are highlighted. The results show that the disinhibition factor is clearly identified and all five items have notable (> .4) loadings on this factor. In contrast, only three items (Coldness, Callous, & Self-Centered) have consistent loadings on the Antagonism factor. The Emotional Stability factor is not identified in the first replication sample because factor 3 shows high loadings for the Extraversion items. The variability of factor loading patterns across datasets may be caused by the arbitrary rotation of factors.

It is unclear why the authors did not use Confirmatory Factor Analysis to test their a priori theory that the 18 items represent different facets of Big Five factors. Rather than relying on arbitrary statistical criteria, CFA makes it possible to examine whether the pattern of correlation is consistent with a substantive theory. Using Collins et al.’s correlations with the Big Five, I fitted a CFA model with four factors to the data. The loading pattern was specified based on Lynam et al.’s (2011) pattern of correlations with a Big Five measure. Correlations greater than .3 were used to allow for a free parameter.

Fit of this model did not meet standard criteria of acceptable model fit (CFI > .95, RMSEA < .06), but it was not terrible, CFI = .728, RMSEA = .088. 29 of the 33 free parameters were statistically significant at p < .05 and many were significant at p < .001. 20 of the coefficients were greater than .3. It is expected that effect sizes are a bit smaller because the indicators were single items and replication studies are expected to show some regression to the mean due to selection effects. Overall, these results show similarity between Lynam et al.’s (2011) results and the pattern of correlations in the replication study.

The next step was to build a revised model to improve fit. The first step was to add a general evaluative factor to the model. Numerous studies of self-ratings of the Big Five and personality disorder instruments have demonstrated the presence of this factor. Adding a general evaluative factor to the model improved model fit, but it remained below standard criteria of acceptable model fit, CFI = .790, RMSEA = .078.

I then added additional parameters that were suggested by large modification indices. First, I added a loading for Impersistence on Extraversion. This loading was just below the arbitrary cut-off value of 30 in Lynam et al.’s study (r = .29). Another suggested parameter was a loading of Invulnerability on Extraversion (Lynam r = .18). A third parameter was a negative loading of Self-Assurance on Agreeableness. This loading was r = .00 in Lynam et al.’s (2011) study, but this could be due to the failure to control for evaluative bias that inflates rates on Self-Assurance and Agreeableness items (Anusic et al., 2009). Another suggested parameter was a positive loading of Opposition on Neuroticism (Lynam r = .18). These modifications improved model fit, but were not sufficient to achieve an acceptable RMSEA value, CFI = .852, RMSEA = .066. I did not add additional parameters to avoid overfitting the model.

The next step was to fit these two models to Rose et al.’s second dataset. The direct replication of Lynam et al.’s (2011) structure did not fit the data well, CFI = .744, RMSEA = .090, whereas fit of the modified model with the general factor was even better than in Study 1, CFI = .886, RSMEA = .062, and RMSEA was close to the criterion for acceptable fit (.060). These results show that I did not overfit the data. I tried further improvements, but suggested parameters were not consistent across the two datasets.

In the final step, I deleted free parameters that were not significant in both datasets. Surprisingly, the Disobliged and Impersistence items did not load on Conscientiousness. This suggests some problems with these single item indicators rather than a conceptual problem because these constructs have been related to Conscientiousness in many studies. Self-contentment did not load on Conscientiousness either. Distrust was not related to Extraversion, and Disobliged was not related to Neuroticism. I then fitted this revised model to the combined dataset. This model had acceptable fit based on RMSEA, CFI = .887, RMSEA = 058.

This final model captures the main structure of the correlations among the 18 EPA-SSF items and is consistent with Lynam et al.’s (2011) investigation of the structure by correlating EPA scales with a Big Five measure. It is also consistent with measurement models that show a general evaluative factor in self-ratings. Thus, I am proposing this model as the first validated measurement model of the EPA-SSF. This does not mean that it is the best model, but critics have to present a plausible model that fits the data as well or better. It is not possible to criticize the use of CFA because CFA is the only method to evaluate measurement models. Exploratory factor analysis cannot confirm or disconfirm theoretical models because EFA relies on arbitrary statistical rules that are not rooted in substantive theories. As I showed, EFA led to the proposal of a three-factor model that has poor fit to the data. In contrast, CFA confirmed that the 18 EPA-SSF items are related to four of the Big Five scales. Thus, four – not three – factors are needed to describe the pattern of correlations among the 18 items. I also showed the presence of a general evaluative factor that is common to self-reports of personality. This factor is often ignored in EFA models that rotate factors.

After establishing a plausible measurement model for the EPA-SSF, it is possible to link the factors to the scale scores that are assumed to measure Psychoticism, using the model indirect function. The results showed that the general factor explained most of the variance in the scale scores, r = .82, r^2 = 67%. Agreeableness/Antagonism explained only r = -.17, r^2 = 3% of the variance. This is a surprisingly low percentage given the general assumption that antagonism is a core personality predictor of anti-social behaviors. Conscientiousness/Disinhibition was a stronger predictor, but also explained less than 10% of the variance, r = -.300, r^2 = 9%. The contribution of Neuroticism and Extraversion was negligible. Thus, the remaining variance reflects random measurement error and unique item content. In short, these results raise concerns about the ability of the EPA-SSF to measure psychoticism rather than a general factor that is related to many personality disorders or may just reflect method variance in self-ratings.

I next examined predictive validity by adding measures of non-violent and violent criminal behaviors. The first model used the EPA-SSF scale to predict the shared variance of non-violent and violent crime based on the assumption that psychopathy is related to both types of criminal behaviors. The fit of this model was slightly better than the fit of the model without the crime variables, CFI = .789 vs. .854, RMSEA = .059 vs. .064. In this model, the EPA-SSF scale was a strong predictor of the crime factor, r = .56, r^2 = 32%. I then fitted a model that used the factors as predictors of crime. This model had slightly better fit than the model that used the EPA-SSF scale as predictor of time, CFI = .796 vs. .789, RMSEA = .059 vs. 059. Most importantly, neuroticism and extraversion were not significant predictors of crime, but the general factor was. I deleted the parameters for neuroticism and extraversion from the model. This further increased model fit, CFI = .802, RMSEA = .058. More important, the three factors explained more variance in the crime factor than the EPA-SSF scale, R = .70, R^2 = 49%. There were no major modification indices suggesting that unique variance of the items contributed to the prediction of crime. Nevertheless, I examined a model that only used the general factor as predictor and added items if they explained additional variance in the crime factor akin to stepwise regression. This model selected four specific items and explained 44% of the variance. The items were Manipulativeness (“I could have my life as a con-artist), b = .22, Self-Centeredness (“I have more important things to worry about than other people’s feelings”), b = .25, and thrill-seeking (“I like doing things that are risky or dangerous”), b = .39.

Dimensional Models of Psychopathy

The EPA-SSF is a dimensional measure of psychopathy. Accordingly, higher scores on the EPA-SSF scale reflect more severe levels of psychopathy. Dimensional models have the advantage that they do not require validation of some threshold that distinguishes normal personality variation from pathological variation. However, this advantage comes with the disadvantage that there is no clear distinction between low agreeableness (normal & healthy) and psychopathy (abnormal & unhealthy). Another problem is that the multi-dimensional nature of psychopathy makes it difficult to assess psychopathy. To illustrate, I focus on the key components of psychopathy, namely antagonism (disagreeableness) and disinhibition (low conscientiousness). One possible way to define psychopathy in relationship to these two components would be to define psychopathy as being high on both dimensions. Another one would be to define it with an either/or rule, assuming that each dimension alone may be pathological. A third option is to create an average, but this definition has the problem that the average of two independent dimensions no longer captures all of the information about the components. As a result, the average will be a weaker predictor of actual behavior. This is a problem of sum score definitions such as socio-economic status that averages income and education and reduces the amount of variance that can be explained by income and education independently.

One way to test the definition of psychopathy as being high in antagonism and disinhibition is to examine whether the two factors interact in the prediction of criminal behaviors. Accordingly, crimes are most likely to be committed by individuals who are both antagonistic and disinhibited, whereas each dimension alone is only a weak predictor of crime. I fitted a model with an interaction term as predictor of the crime factor. The interaction effect was not significant, b = .04, se = .14, p = .751. Thus, there is presently no justification to define psychopathy as a combination of antagonism and disinhibition. Thus, psychoticism appears to be better defined as being either antagonistic or disinhibited to such an extent that individuals engage in criminal or other harmful behaviors. Yet, this definition does not really add anything to our understanding of personality and criminal behavior. It is like the term infection that may refer to a viral or bacterial infection.

The Big Five Facets and Criminal Behavior

Investigation of the construct validity of the EPA-SSF showed that the 18-items reflect four of the Big Five dimensions and that two of the Big Five factors predicted criminal behavior. However, the 18-items are poor indicators of the Big Five factors. Fortunately, Rose et al. (2022) also included a 120 -item Big Five measure that also measures 30 Big Five facets (4 items per scale). It is therefore possible to examine the personality predictors of criminal behaviors with a better instrument to measure personality. To do so, I first fitted a measurement model to the 30 facet scales. This model was informed by previous CFA analyses of the 30 facets. Most importantly, the model included a general evaluative factor that was independent of the Big Five factors. I then added the items about nonviolent and violent crime and created a factor for the shared variance. Finally, I added the three EPA-SSF items that appeared to predict variance in the crime factor. I also related these items to the facets that predicted variance in these items. The final model had acceptable fit according to the RMSEA criterion (< .006), RMSEA = .043, but not the CFI criterion (> .95), CFI = .874, but I was not able to find meaningful ways to improve model fit.

The personality predictors accounted for 61% of the variance in the crime factor. This is more variance than the EPA-SSF factors explained. The strongest predictor was the general evaluative or halo factor, b = -.49, r^2 = 24%. Surprisingly, the second strongest predictor was the Intellect facet of Openness and the relationship was positive, b = .41, r^2 = .17%. More expected was a significant contribution of the Compliance facet of Agreeableness, b = -.31, r^2 = 9%. Finally, the unique variance in the three EPA-SSF items (controlling for evaluative bias and variance explained by the 30 facets) added another 10% explained variance, r = .312.

These results further confirm that Emotional Stability is not a predictor of crime, suggesting that it should not be included in the definition of psychopathy. These results also raise questions about the importance of disinhibition. Surprisingly, conscientiousness was not a notable predictor of crime. It is also notable that Agreeableness is only indirectly related to crime. Only the Compliance facet was a significant predictor. This means that disagreeableness is only problematic in combination with other unidentified factors that make disagreeable people non-compliant. As a result, it is problematic to treat the broader agreeableness/antagonism factor as a disorder. Similarly, all murders are human, but we would not consider being human a pathology.


Concerns about the validity of psychological measures led to the creation of a taskforce to establish scientific criteria of construct validity (Cronbach & Meehl, 1955). The key recommendation was to evaluate construct validity within a nomological net. A nomological net aims to explain a set of empirical findings related to a measure in terms of a theory that predicts these relationships. Psychometricians developed structural equation modeling (SEM) that make it possible to test nomological nets. Here, I used structural equation modeling to examine the construct validity of the Elemental Psychopathy Assessment – Super Short Form scale.

My examination of the psychometric properties of this scale raise serious questions about its construct validity. The first problem is that the scale was developed without a clear definition of psychopathy. The measure is based on the hypothesis that psychoticism is related to 18 distinct, maladaptive personality traits (Lynam et al., 2011). This initial assumption could have led to a program of validation research that could have suggested revisions to this theory. Maybe some traits were missing or unnecessary. However, the measure and its short-form have not been revised. This could mean that Lynam et al. (2011) discovered the nature of psychopathy in a strike of genius or that Lynam et al. failed to test the construct validity of the EPA. My analyses suggest the latter. Most importantly, I showed that there is no evidence to include Emotional Stability in the definition and measurement of psychopathy.

I am not the first to point out this problem of the EPA. Collins et al. (2016) discuss the inclusion of Emotional Stability in a definition of psychopathy at length.

it may seem counter-intuitive that Emotional Stability would be included as a factor of the super-short form (and other EPA forms). We do not believe its inclusion is inconsistent with
our previous positions as the EPA was developed, in part, from clinical expert ratings of personality traits of the prototypical psychopath. Given that many of these ratings came from proponents of the idea that FD is a central component of psychopathy, it is natural that traits resembling FD, or emotional stability, would be present in the obtained profiles. While we present Emotional Stability as a factor of the EPA measure, however, we do not claim Emotional Stability to be a central feature of psychopathy. Its relatively weak relations to other measures
of psychopathy and external criteria traditionally related with psychopathy support this argument (Gartner, Douglas, & Hart, 2016; Vize, Lynam, Lamkin, Miller, & Pardini, 2016).” (p. 2016).

Yet, Rose et al. (2022) treat the EPA-SSF as if it is a valid measure of psychopathy and make numerous theoretical claims that rely on the assumption that the EPA-SSF is a valid measure of psychopathy. It is of course possible to define psychopathy in terms of low neuroticism, but it should be made clear that this definition is stipulative and cannot be empirically tested. The construct that being measured is an artifact that is created by the researchers. While neuroticism is a construct that describes something in the real world (some people are more anxious than others), psychoticism is merely a list of traits. Some people may want to include psychoticism on the list and others may not. The only problem is when term psychoticism is used for different lists. The EPA scale is best understood as a measure of 18 traits. We may call this psycoticism-18 to distinguish it from other constructs and measures of psychoticism.

The list definition of psychological constructs creates serious problems for the measurement of these constructs because list theories imply that a construct can be defined in terms of its necessary and sufficient components. Accordingly, a psychopath could be somebody who is high in Emotional Stability, low in Agreeableness, and low in Conscientiousness, or somebody who possess all of the specific traits included in the definition of Psychoticism-18. However, traits are continuous constructs and it is not clear how individual profiles should be related to quantitative variation in psychoticism. Lynam et al. sidestepped this problem by simply averaging across the profile scores and to treat this sum score as a measure of psychoticism. However, averaging results in a loss of information and the sum score depends on the correlations among the traits. This is not a problem when the intended construct is the factor that produces the correlations, but it is a problem when the construct is the profile of trait scores. As I showed, the sum score of the 18 EPA-SSF items mainly reflects information about the variance that is shared among all 18 items, which reflects a general evaluative factor. This general factor is not mentioned by Lynam et al. and is clearly not the intended construct that the EPA-SSF was intended to measure. Thus, even if psychopathy were defined in terms of 18 specific traits, the EPA-SSF sum score does not provide information about psychopathy because the actual information that the items were supposed to be measured is destroyed by averaging them.


In conclusion, I am not an expert on personality disorders or psychopathy. I don’t know what psychopathy is. However, I am an expert on psychological measurement and I am able to evaluate construct validity based on the evidence that authors’ of psychological measures provide. My examination of the construct validity of the EPA-SSF using the authors own data makes it clear that the EPA-SSF lacks construct validity. Even if we follow the authors proposal that psychopathy can be defined in terms of 18 specific traits, the EPA-SSF sum score fails does not capture the theoretical construct. If you would take this test and get a high score, it doesn’t mean you are a psychopath. More importantly, research findings based on this measure do not help us to explore the nomological network of psychopathy.

Hierarchical Factor Analysis

One important scientific activity is to find common elements among objects. Well-known scientific examples are the color wheel in physics, the periodic table in chemistry, and the Linnaean Taxonomy in biology. A key feature of these systems is the assumption that objects are more or less similar along some fundamental features. For example, similar animals in the Linnaean Taxonomy share prototypical features because they have a more recent common ancestor.

The prominences of classification systems in mature sciences suggests that psychology could also benefit from classification of psychological objects. A key goal of psychological science is to understand human’s experiences and behaviors. At a very abstract level, the causes of experiences and behaviors can be separated into situational and personality factors (Lewin, 1935). The influence of personality factors can be observed when individuals act differently in the same situation. The influence of situations is visible when the same person acts differently in differently situations.

Personality psychologists have worked on a classification system of personality factors for nearly a century, starting with Allport and Odbert (1936) catalogue of trait words, and Thurstone’s (1934) invention of factor analysis. Factor analysis has evolved and there are many different options to conduct a factor analyses. The most important development was the invention of confirmatory factor analysis (Joreskog, 1969). Confirmatory factor analysis has several advantages over traditional factor analytic models that are called exploratory factor analyses to distinguish them from confirmatory analyses. Confirmatory factor analysis has several advantages over exploratory factor analysis. The most important advantage is the ability to test hierarchical models of personality traits (Marsh & Myers, 1986). The specification of hierarchical models with CFA is called hierarchical factor analysis. Despite the popularity of hierarchical trait models, personality researchers continue to rely on exploratory factor analysis as the method of choice. This methodological choice impedes progress in the search for a structural model of personality traits.

Metaphorical Science

A key difference between EFA and CFA is that EFA is atheoretical. The main goal is to capture the most variance in observed variables with a minimum of factors. This purely data driven criterion implies that the number of factors and the nature of factors is arbitrary. In contrast, CFA models aim to fit the data and it is possible to compare models with different numbers of factors. For example, EFA would have no problem of showing a single first factor, even if feminine and masculine traits were independent (Marsh & Myers, 1986). However, such a model might show bad fit, and model comparison could show that a model with two factors fits the data better. The lack of model fit in traditional EFA applications may explain Goldberg’s attempt to explore hierarchical structures with a series of EFA models that specify different numbers of factors, starting with a single factor and adding one more factor at each step. For all solutions, factors are rotated based on some arbitrary criterion. Goldberg prefers Varimax rotation. As a consequence, factors within the same model are uncorrelated. His Figure 2 shows the results when this approach was used for a large number of personality items.

To reinforce the impression that this method reveals a hierarchical structure, factors at different levels are connected with arrows that point from the higher levels to the lower levels. Furthermore, correlations between factor scores are used to show how strong factors at different levels are related. Readers may falsely interpret the image as evidence for a hierarchical model with a general factor on top. Goldberg openly admits that his method does not hierarchical causal models and that none of the levels may correspond to actual personality factors.

‘To many factor theorists, the structural representations included in this article are not
truly “hierarchical,” in the sense that this term is most often used in the methodological
literature (e.g., Yung, Thissen, & McLeod, 1999). For those who define hierarchies in
conventional ways, one might think of the present procedure in a metaphorical sense” (p. 356).

The difference between a conventional and unconventional hierarchical model is best explained by the meaning of a directed arrow in a hierarchical model. In a conventional model, an arrow implies a causal effect and causal effects of a common cause produce a correlation between the variables that share a common cause (PSY100). For example, in Figure 1 , the general factor correlates r = .79 with the first factor a level 2 and r = .62 with the second factor at level 2. The causal interpretation of these path coefficients would imply that the correlation between the two level-2 factors is .79 x .62 = .49. Yet, it is clear that this prediction is false because factors at the same level were specified to be independent. It therefore makes no sense to draw the arrows in this direction. Goldberg realizes this, but does it anyways.

“While the author has found it useful to speak of the correlations between factor scores at different levels as “path coefficients,” strictly speaking they are akin to part-whole correlations, but again the non-traditional usage can be construed metaphorically” .

It would have been better to draw the arrows in the opposite direction because we can interpret the reversed path coefficients as information about the loss of information when the number of factors is reduced by one. For example, the correlation of r = .79 between the first level 2 factor and the general factor implies that the general factor sill captures .79^2 = 62% of the variance of the first level 2 factor and 38% of the variance is lost in the one-factor model. Goldberg fittingly called his approach ass-backwards and that means the arrows need be interpreted in the reverse direction.

The key advantages of Goldberg’s approach is that researchers did not need to buy additional software before R made CFA free of charge, did no have to learn structural equation modeling, and did not have to worry about model fit. A hierarchical structure with a general factor could always be found, even if the first factor is unrelated to some of the lower factors (see Figure 3 in Goldberg).

There is also no need to demonstrate consistency across datasets. The factors in he two models show different relations to the five factors at the lowest level. This is the beauty of metaphorical science. Every analysis provides a new metaphor that reflects personality without any ambition to reveal fundamental factors that influence human behavior.

Metaphorical Pathological Traits

It would be unnecessary to mention Goldberg’s metaphorical hierarchical models, if personality researchers had ignored his approach and used CFA to test hierarchical models. in fact, there have been no notable applications of Goldberg’s approach in mainstream personality psychology. However, the method has gained popularity among clinical psychologists interested in personality disorders. A highly cited article by Kotov et al. (2017) claims that Goldberg’s method “supported the presence of a p factor, but also suggested
that multiple meaningful structures of different generality exist between the six spectra and a p factor” (p. 463). I do not doubt that meaningful metaphors can be found to describe maladaptive traits, but it is problematic that interpretability is the sole criterion to justify the claim of a hierarchical structure of personality factors that may cause intrapersonal and interpersonal problems. Although Kotov et al. (2017) mention confirmatory factor analysis as a potential research tool, they do not mention that Goldberg’s method is fundamentally different from hierarchical CFA.

The most highly cited application of Goldberg’s method is published in an article by Wright, Thomas, Hopwood, Markon, Pincus, and Krueger (2012). The data are undergraduate (N = 2,461) self-ratings on the 220 items of the Personality Inventory for DSM-5. The 220 items are scored to provide information about 25 maladaptive traits that are correlated with each other. Wright et al. show that the correlations among the 25 scales can be represented with five correlated factors, but they do not provide fit indices of the five-factor solution. Correlations among the five factors ranged from r = .043 to .437.

Figure 1 in Wright et al. (2012 shows Goldberg’s hierarchical structure.

Naive interpretation of the structure and path coefficient seems to suggest the presence of a strong general factor that contributes to personality pathology. This general factor appears to explain a large amount of variance in internalizing and externalizing personality problems. Internalizing and externalizing factors explain considerable variance in four of the five primary factors, but psychoticism appears to be rather weakly related to the other traits and the p-factor. However, this interpretation of the results is only metaphorical.

A proper interpretation of the hierarchy focuses on the variance that is lost when five factors are reduced to fewer factors. For example, by combining the internal and external factors into a single p-factor implies that 72% of the variance in internalizing traits is retrained and 28% are lost. For externalizing traits, only 32% of the variance is retained and 68% is lost. Combined the reduction of two factors to one factors leads to a loss of 96% of the variance. This implies that the two factors are orthogonal because reducing two independent factors to one leads to a loss of 50% of the variance in each and a loss of 100% of the variance in both (200% total). Thus, rather than supporting the presence of a strong p-factor, Figure 1 actually suggests that there is no strong general factor. This is not surprising when we look at the correlations among the factors. Negative affect (internalizing) correlated weakly with the externalizing factors antagonism, r = .04, and disinhibition, r = .09. These correlations suggest that internalizing and externalizing traits are independent, rather than sharing a common influence of a general pathology factor.

Hierarchical Confirmatory Factor Analysis

I used the correlation matrix in Wright et al.’s (2012) Table 1 to build a hierarchical model with CFA. The first model imposed a hierarchical structure with 4 levels and a correlation between the top two factors. It would have been possible to specify a general factor, but the loadings of the two factors on this general factor are not determined. The model had good fit, chi2 (2) = 2.51, CFI = 1.000, RMSEA = .010.

The first observation is that the top two factors are only weakly correlated, r = .11. This supports the conclusion that there is no evidence for a general factor of personality pathology that contributes substantially to correlations among specific PID-5 scales. The second observation is that many factors at higher levels are identical to lower level traits. Thus, the observation that there are factors at all levels is illusory. The NA factor at the highest level is practically identical with the NA factor at the lowest level. The duplication of factors at various levels is unnecessary and confusing. Therefore I built a truly hierarchical CFA model that does not specify the number of levels in the hierarchy a priori. This model also had good fit, chi2(df = 2) = 2.51, CFI = 1.000, RMSEA = .010.

The model shows that detachment and negative affect are related to each other by a shared factor (F1-1) that could be interpreted as internalizing. Similarly, Antagonism and Disinhibition share a common factor (F1-2) that could be labeled externalizing. At a higher level, a general factor relates these two factors as well as psychoticism. The loadings on the general factor are high, suggesting that scores on the PID-5 scales are correlated with each other because they share a single common factor. The low correlations between Negative Affect and externalizing are attributed to a negative relationship of the externalizing factor (F1-2) and Negative Affect.

The good fit of these models does not imply that they capture the true nature of the relationships among PID-5 scales. It is also not clear whether the p-factor is a substantive factor or reflects response styles. However, unlike Goldberg’s method, HCFA can be used to test hierarchical models of personality traits. Thus, researchers who are speculating about hierarchical structures need to subject their models to empirical tests with HCFA. Goldberg’s method is metaphorical, unsuitable, and unscientific. It creates the illusion that it reveals hierarchical structures, but it merely shows which variances are lost in models with fewer factors. In contrast, HCFA can be used to test models that aim to explain variance rather than throwing it away.

Personality Disorder Research: A Pathological Paradigm

Every scientist who read Kuhn’s influential book “The structure of scientific revolutions” might wonder about the long-term impact of their work. According to Kuhn, scientific progress is marked by periods of normal growth and periods of revolutionary change (Stanford Encyclopedia of Philosophy, 2004). During times of calm, scientific research is guided by paradigms. Paradigms are defined as “the key theories, instruments, values and metaphysical assumptions that guide research and are shared among researchers within a field” Paradigm shifts occur when one or more of these fundamental assumptions are challenged and shown to be false.

Revolutionary paradigm shifts can have existential consequences for scientists who are invested in a paradigm. Just like revolutionary technologies may threaten incumbent technologies (e.g., electric vehicles), scientific research may lose its value after a paradigm shift. For example, while the general principles of operant conditioning hold, many of the specific studies with Skinner boxes lost their significance after the demise of behaviorism. Similarly, the replicability revolution invalidated many social psychological experiments on priming after it became apparent that selective publication of significant results with small samples produces results that cannot be replicated, a hallmark feature of science.

Personality research has seen a surprisingly long period of paradigmatic growth over the past 40 years. In the 1980s, a consensus emerged that many personality traits can be related to five higher-order factors that became to be known as the Big Five. Paradigmatic research on the Big Five has produced thousands of results that show how the Big Five are related to other trait measures, genetic and environmental causes, and life outcomes. The key paradigmatic assumption of the Big Five paradigm is that self-reports of personality are accurate measures of the Big Five traits. Aside from this basic assumption, the Big Five paradigm is surprisingly vague about other common features of paradigms. For example, there does not exist a dominant theory about the nature of the Big Five traits (i.e., What are these dimensions). It is also unclear why there would be five rather than four or six higher order traits. Moreover, there is no agreement about the relationship among the Big Five (are they independent or correlated), or the relationship of specific traits to the Big Five (e.g., is trust related to neuroticism, agreeableness, or both?). These questions can be considered paradigmatic questions that researchers aim to answer by conducting studies with self-report measures of the Big Five.

Research on personality disorders has an even longer history with its roots in psychiatry and psychodynamic theories. However, the diagnosis of personality disorders also witnessed a scientific revolution when clinical psychologists started to examine disorders from the perspective of Big Five theories of normal personality (Millon & Frances, 1987). Psychological research on personality disorders was explicitly framed as developing an alternative model of personality disorders that would replace the model of personality disorders developed by psychiatrists (Widiger & Simonsen, 2005). Currently, the scientific revolution is ongoing and the Diagnostic and Statistical Manual of Mental Disorders lists several approaches to the diagnosis of personality disorders.

The common assumptions of the Personality Disorder Paradigm in Clinical Psychology are that (a) there is no clear distinction between normal and disordered personality and that disorders are defined by arbitrary values on a continuum (Markon, Krueger, & Watson, 2005), (b) the Big Five traits account for most of the meaningful variance in personality disorders (Costa & McCrae, 1980), and self-reports of personality disorders are valid measures of actual personality disorders (Markon, Quilty, Bagby, & Krueger, 2013).

Over the past two decades, paradigmatic research within the PDP has examined personality disorders using the following paradigmatic steps: (a) write items that are intended to measure personality disorders or maladaptive traits, (b) administer these items to a sample of participants, (c) demonstrate that these items have internal consistency and can be summed to create scale scores, and (d) examine the correlations among scale scores. These studies typically show that personality disorder scales (PDS) are correlated and that five factors can represent most, but not all, of these correlations (Kotov et al., 2017). Four of these five factors appear to be similar to four of the Big Five factors, but Openness is not represented in factor models of personality disorders and Psychoticism is not represented in the Big Five. This has led to paradigmatic questions about the relationship between Openness and Psychoticism among personality disorder researchers.

Another finding has been that factor analytic models that allow for correlations among factors show replicable patterns of correlations. This finding is surprising because the Big Five factors and measures of the Big Five were developed using factor models that impose independence on factors and items were selected to be representative of these orthogonal factors. The correlations among personality scales have been the topic of various articles in the Big Five paradigm and the personality disorder paradigm. This research has led to hierarchical models of personality with a single factor at the top of the hierarchy (Musek, 2007). In the personality disorder paradigm, this factor is often called “general personality pathology,” the “general factor of personality pathology,” or simply the p-factor (Asadi, Bagby, Krueger, Pollock, & Quilty, 2021; Constantinou et al., 2022; Hopewood, Good, & Morey, 2018; Hyatt et al., 2021; McCabe, Oltmanns, & Widiger, 2022; Oltmanns, Smith, Oltmanns, & Widiger, 2018; Shields, Giljen, Espana, & Tackett, 2021; Uliaszek, Al-Dajani, Bagby, 2015, Van den Broeck, Bastiaansen, Rossi, Dierckx, De Clercq, & Hofmans, 2014; Widiger & Oltmanns, 2017; Williams, Scalco, & Simms, 2018).

It is symptomatic of a pathological paradigm that researchers within the paradigm have uncritically accepted that the general factor in a factor analysis represents a valid construct and that alternative interpretations of this finding are ignored or dismissed with flawed arguments. Most aforementioned articles do not even mention alternative explanations for the general factor in self-ratings. Others, mention, but dismiss the possibility that this general factor at least partially reflects method variance in self-ratings. McCabe et al. (2022) note that “the results of the current study are consistent with, or at least don’t rule out, the social undesirability or evaluation bias hypothesis” (p. 151). They dismiss this alternative explanation with a reference to a single study from 1983 that showed “much of the variance in socially desirability scales was substantively meaningful individual differences (McCrae & Costa, 1983)” (p. 151). Notably, the authors cite several more recent articles that provided direct evidence for the presence of evaluative biases in self-ratings of personality (Anusic, Schimmack, Pinkus, & Lockwood, 2009; Backstrom, Bjorklund, and Larsson, 2009; Chang, Connelly, & Geeza, 2012; Pettersson, Turkheimer, Horn, & Menatti, 2012), but do not explain why these studies do not challenge their interpretation of the general factor in self-ratings of personality disorders.

The strongest evidence for the interpretation of the general factor as a method factor comes from multi-trait-multi-method studies (Campbell & Fiske, 1959). True traits should show convergent validity across raters. In contrast, method factors produce correlations among ratings by the same rater, but not across different raters. Most factors are likely to be a mixture of trait and method variance. Thus, it is essential to quantify the amount of method and trait variance and avoid general statements of validity (Cronbach & Meehl, 1955; Schimmack, 2021). A few studies of personality disorders have used multiple methods. However, most publications have not analyzed these data using a multi-trait-multi-method approach to separate trait and method variance. I could find only one article that modeled multi-method data (Blackburn, Donnelly, Logan, & Renwick, 2004). Consistent with multi-method studies of normal personality, the results showed modest convergent validity across raters and a clear method factor that often explained more variance in self-report scales than the trait factors. However, this finding has been ignored by subsequent researchers. To revisit this issue, I analyzed three multi-method datasets.

Study 1

Lisa M. Niemeyer, Michael P. Grosz, Johannes Zimmermann & Mitja D. Back
(2022) Assessing Maladaptive Personality in the Forensic Context: Development and Validation
of the Personality Inventory for DSM-5 Forensic Faceted Brief Form (PID-5-FFBF), Journal of
Personality Assessment, 104:1, 30-43, DOI: 10.1080/00223891.2021.1923522

This study was conducted in Germany with male prisoners. Personality was measured with self-reports and informant ratings by the prisoners’ psychologist or social worker and a penal service officer. This made it possible to separate method and trait variance, where trait variance is defined as variance that is shared among the three raters.

Normal personality was assessed with a 15-item measure of the Big Five. Given the small sample, scale scores rather than items were used as indicators of normal personality. Maladaptive personality was measured with a forensic adaptation of the German version of the PID-5 faceted Brief Form. This measure has 25 scales. Given the small sample size the focus was on scales that can serve as indicators of four higher-order factors that are related to neuroticism, extraversion, agreeableness, and conscientiousness. A fifth psychoticism factor could not be identified in this dataset. The scales were Anxiousness, Separation Insecurity, and Depression for Neuroticism (Negative Affectivity), Withdrawal, Intimacy Avoidance, and Anhedonia for low Extraversion (Detachment), Manipulativeness, Deceitfulness, and Grandiosity for low Agreeableness (Antagonism), and Impulsivity, Irresponsibility, and Distractibility for low Conscientiousness (Disinhibition).

There are multiple ways to separate method and trait variance in hierarchical multi-trait-multi-method models. I used Anusic et al.’s (2009) approach that first modeled the hierarchical structure separately for each rater and then defined trait factors at the highest level of the hierarchy. This approach makes it possible to examine the amount of convergent validity for the higher-order factors that are the primary focus in this analysis. Additional agreement for the unique variance in facets was modeled using a bi-factor approach where additional facet factors reflect only the unique variance in facets.

The first model assumed that there are no secondary loadings, no correlations among the four trait factors, and no correlations among the rater-specific indicators. This model had poor fit, CFI = .732, RMSEA = .079.

The second model added correlations among the four Big Five factors. The general personality pathology model predicts correlations among the four factors. Neuroticism should be negatively corelated with the other three factors and the other other three factors should be positively correlated. Allowing for these correlations improved model fit, CFI = .754, RMSEA = .076, but overall fit was still poor. Furthermore, the pattern of correlations did not conform to predictions. Mainly, neuroticism was positively correlated with agreeableness, r = .24, and extraversion was negatively correlated with agreeableness, r = -.12.

Exploration of the other relationships suggested that secondary loadings accounted for some of the correlations among the trait factors. Namely, Anxiousness and Depression had negatively loadings on Extraversion and Anhedonia had a secondary loading on Neuroticism (N-E); Deceitfulness had a secondary loading on Conscientiousness and Impulsiveness and Irresponsibility had secondary loadings on Agreeableness (A-C); finally, Impulsiveness, Irresponsibility, and Distractedness had secondary loadings on Neuroticism (N-C). Adding these secondary loadings to the measurement model of each rater improved model fit and RMSEA suggested acceptable fit, CFI = .853, RMSEA = .060. In this model, none of the correlations among the trait factors were significant at the .01 level, and the pattern still did not conform to predictions of the g-factor model. However, it was possible to replace the correlations among the four factors with a fixed loading pattern. This model had only slightly worse fit, CFI = .850, RMSEA = .060. Loadings on this factor ranged from .19 for extraversion to r = .46 for conscientiousness. At the same time, a model without correlations or a GFP factor equally fit the data, CFI = .850, RMSEA = .060.

The next models examined potential method factors. Evaluative bias factors were added for each of the three raters with fixed loadings. This improved model fit, CFI = .854, RMSEA = .059. The standardized loadings for the halo factor in self-ratings ranged from r = .31 (Extraversion) to .67 (Conscientiousness). A general factor could not be identified for one of the informant factors and none of the loadings on the other informant factor were significant at alpha = .01. This suggests that evaluative biases were mostly present in self-ratings. Removing the method factors for the two informants did not change model fit, CFI = .854, RMSEA = .059.

To examine the unconstrained relationships among the self-rating factors, I replaced the method factor with free correlations among the four self-rating factors. Model fit decreased a bit, indicating that the more parsimonious model did not severely distort the pattern of correlations, CFI = .854, RMSEA = .060. The pattern of correlations matched predictions of the halo model, but only one of the correlations was significant at alpha = .01.

In conclusion, model comparisons suggested the presence of an evaluative bias factor in self-ratings of German male prisoners and provided no evidence that a general personality pathology factor produces correlations among measures of normal and maladaptive personality. Of course, these results from a relatively small sample drawn from a unique population cannot be generalized to other populations, but the results are consistent with multi-method studies of normal personality in various samples ().

Study 2

Brauer, K., Sendatzki, R. & Proyer, R. T. (2022). Localizing gelotophobia, gelotophilia, and katagelasticism in domains and facets of maladaptive personality traits: A multi-study report using self- and informant ratings. Journal of Research in Personality98, No. 104224.

This study examined personality disorders in a German community sample. For each target, one close other provided informant ratings. Personality disorders were assessed with the German version of the brief PID-5 that is designed to measure five higher-order dimensions of personality pathology, namely Negative Affectivity (Neuroticism), Detachment (low Extraversion), Antagonism (low Agreeableness), Disinhibition (low Conscientiousness), and Psychoticism (not represented in the Big Five model of normal personality).

With only two methods, it is necessary to make assumptions about the validity of each rater. A simple way of doing so is to constrain the unstandardized loadings of self-ratings and informant ratings under assumption that self-ratings and informant ratings are approximately equally valid. The first model assumed that there is no method variance and that the five factors are independent. This model had poor fit, CFI = .295, RMSEA = .236. I then allowed the five factors to correlate freely with each other. Model fit improved, but remained low, CFI = .605, RMSEA = .204. The pattern of correlations conformed to the predictions of the general personality pathology model. I then added method factors for self-ratings and informant ratings to the model. Loadings on this factors were constrained to be equal for all five scales. This modification increased model fit and model fit considerably, CFI = .957, RMSEA = .067. Loadings on the method factor were substantial (> .4). Furthermore, several of the trait correlations were no longer significant at the .01 level, suggesting that some of these correlations were spurious and reflected unmodeled method variance.

The next model examined whether some of the variance in the method factors reflected an actual g-factor of personality pathology. To do so, I removed the correlations among the trait factors and let the two method factors correlate. A correlation between these method factors can be interpreted as convergent validity for independent measures of the g-factor. This modification produced a reduction in model fit, CFI = .865, RMSEA = .107, and showed a significant correlation between the two method factors, r = .33. This finding suggests that one-third of the variance in these factors may reflect a real g-factor. However, model fit suggests that this model mispresents the actual pattern of correlations in the data. Exploratory analyses suggested that Extraversion and Psychoticism were negatively related to Conscientiousness and that Psychoticism had a stronger loading on the method factor. Adding these modifications raised model fit to acceptable levels, CFI = .957, RMSEA = .064. In this model the two method factors remained correlated, but the confidence interval shows that a substantial amount of the variance is unique to each method factor, r = .27, 95%CI = .08 to .44. Although, the lower bound of the confidence interval is close to zero. In sum, these results provide further evidence that the general factor in self-ratings of personality pathology reflects partially method variance rather than a general disposition to have many personality disorders.

Study 3

Oltmanns, J. R., & Widiger, T. A. (2021). The self- and informant-personality inventories for ICD-11: Agreement, structure, and relations with health, social, and satisfaction variables in older adults. Psychological Assessment, 33(4), 300–310.

The data of this study are from a longitudinal study of personality disorders. Participants nominated close relatives who provided informant ratings. Personality disorders were measured using the Personality Inventory for ICD-11 and an informant version of the same questionnaire. The questionnaire assesses 5 dimensions. Four of these dimensions correspond to the Big Five, namely Negative Affectivity (Neuroticism), Detachment (low Extraversion), Dissociality (low Agreeableness), and Disinhibition (low Conscientiousness). The fifth dimension is called Anankastia, which may be related to a maladaptive form of high Conscientiousness (e.g., Perfectionism).

I used the published MTMM matrix in Table 2 to examine the presence of a general personality factor and method variance. Like the original authors, I fitted a four-factor model with Disinhibition and Anankastia as opposite indicators of a bipolar Conscientiousness factor. The four factors were clearly identified, but the model without method factors and correlations among the four factors did not fit the data, CFI = .279, RMSEA = .243. Allowing for correlations among the four factors improved model fit, but fit remained poor, CFI = .443, RMSEA = .232. The pattern of correlations was consistent with the p-factor predictions. The next model added method factors for self-ratings and informant ratings. All loadings except those for the Anankastia scales were fixed. The Anankastia loadings were free because high conscientiousness is desirable and should load less on a factor that reflects undesirable content in other scales. The inclusion of method factors improved model fit, CFI = .844, RMSEA = .131, but fit was not acceptable. All scales except the Anankastia scales had notable (> .4) loadings on the method factors. Only the trait correlation between agreeableness and conscientiousness was significant at alpha = .01, r = .22. The next model removed all of the other correlations and allowed for a correlation between the two method factors. This model had similar fit, CFI = .845, RMSEA = .123. The correlation between the two method factors was r = .24. Exploratory analysis showed rater specific correlations between the Disinhibition and Anankastia (i.e., low and high conscientiousness) scales. Adding these parameters to the model improved model fit, CFI = .920, RMSEA = .091, but did not alter the correlation between the two method factors, r = .26. Freeing the loading of the Negative Affectivity scale on the method factors further improved model fit, CFI = .948, RMSEA = .076, but did not alter the correlation between the two method factors. Freeing the loading of the Disinhibition scales on the method factors further improved model fit, CFI = .972, RMSEA = .058, but the correlation between the two method factors remained the same, r = .24. The 95% confidence interval ranged from .14 to .33. These results are consistent with Study 2.

General Discussion

Factor analyses of personality disorder questionnaires have suggested the presence of one general factor that predicts higher scores on all measures of maladaptive personality traits. A major limitation of these studies was the reliance on a single method, typically self-reports. Mono-method studies are unable to distinguish between method and trait variance (Campbell & Fiske, 1959). Although a few studies have used multiple methods to measure personality traits, personality disorder researchers did not analyze these data with multi-method models. I provide results of multi-method modeling of three datasets. All three datasets show a method factor in self-reports of personality disorders that is either independent of informant ratings by observers (Study 1) or only weakly related to informant ratings by close others (Studies 2 and 3). These results are consistent with the presence of method factors in self-ratings of normal personality (Anusic et al., 2009; Biesanz & West, 2004; Chang et al., 2012). It is therefore not surprising that the same results were obtained. However, it is surprising that the personality disorder researchers have ignored this evidence. The reason does not appear a lack of awareness that multi-method data are important. For example, nearly a decade ago several prominent personality disorder researchers noted that “because most of these studies (including our own study) are based on self-report measures, substantial parts of the multitrait-multimethod matrix currently remain unexplored” (Zimmermann, Altenstein, Krieger, Holtforth, Pretsch, Alexopoulus, Spitzer, Benecke, Krueger, Markon, & Leising, 2014). One possible explanation for the lack of multi-method analyses might be that they threaten the construct validity of personality disorder instruments if a substantial portion of the variance in personality disorder scales reflects method factors. Nearly two decades ago, Blackburn, a multi-method model of personality showed that method factors explained more variance than trait factors (Blackburn, Donnelly, Logan, & Renwick, 2004). However, this article received only 43 citations, whereas articles that interpret method variance as a general trait have garnered over 1,000 citations (Kotov et al., 2017). The uncritical reliance on self-ratings reveals a pathological paradigm that is built on false assumptions. Self-reports of personality disorders are not highly valid measures of personality disorders. Even if scale scores are internally consistent and reliable over time, they cannot be accepted as unbiased measures of actual pathology. The limitations of self-reports are well known in other domains and have led to reforms in clinical assessments of other disorders. For example, the DSM-5 now explicitly states that the diagnosis of Attention Deficit Hyperactivity Disorder (ADHD) requires assessment with symptom ratings by at least two raters (Martel, Schimmack, Nikolas, & Nigg, 2015). The present results show the importance of a multi-rater approach to assess personality disorders.

In contrast, the assessment of personality disorders in the DSM-5 is not clearly specified. A traditional system is still in place, but two alternative approaches are also mentioned. One approach is called the criterion A. It is based on the assumption that distinct personality disorders are highly correlated and that it is sufficient to measure a single dimension that is assumed to reflect severity of dysfunction (Sharp & Wall, 2021). A popular self-report measure of Criterion A is the Levels of Personality Functioning Scale (Morey et al., 2011). Aside from conceptual problems, it has been shown that the LPFS is nearly identical to measures of evaluative bias in self-ratings of normal personality (Schimmack, 2022). The present results show that most of this variance is rater-specific and reflects method factors. Thus, there is currently no evidence for the construct validity for measures of general personality functioning. Unless such evidence can be provided with multi-method data, Criterion A should be removed from the next version of the DSM.

Meta-Psychological Reflections

Psychology is not a paradigmatic science that rests on well-established assumptions. In contrast, psychology is best characterized as a collection of mini-paradigms that are based on assumptions that are not shared across paradigms. For example, many experimentalists would request evidence based on cross-sectional studies of self-reports. In contrast, research on personality disorders rests nearly entirely on the assumption that self-reports provide valid information about personality disorders. While most researchers would probably acknowledge that method factors exist, there are no scientific attempts to assess and minimize their impact. Instead, method effects are minimized by false appeals to the validity of self-reports. For example, without any empirical evidence and total disregard of existing evidence, Widiger and Oltmanns (2017) state “It is evident that most persons are providing reasonably accurate and
honest self-descriptions. It would be quite unlikely that such a large degree of variance would reflect simply impression management” (p. 182). The need for a multi-method assessment is often acknowledged in the limitation section and delegated to future research that is never conducted, even if these data are available (Asadi, Bagby, Krueger, Pollock, & Quilty, 2021). These examples are symptomatic of a pathological paradigm in need of a scientific revolution. Unlike other pathological paradigms in psychology, the assessment of mental illnesses has huge practical implications that require a thorough examination of the assumptions that underpin clinical diagnoses. At present, the diagnosis of personality disorder is not based on scientific evidence and claims about validity of personality disorder measures are unscientific. Most likely a valid assessment of personality disorders requires a multi-method approach that follows the steps outlined by Chronbach and Meehl (1955). To make progress, clinical psychologists need better training in psychometrics and journals need to prioritize multi-method studies over quick and easy studies that rely exclusively on self-reports with online samples (e.g. Hopwood et al., 2018). Most importantly, it is necessary to overcome the human tendency to confirm prior beliefs and to allow empirical data to disconfirm fundamental assumptions. It is also important to listen to scientific criticism, even if it challenges fundamental assumptions of a paradigm. Too often scientists take these criticism as personal attacks because their self-esteem and identity is wrapped up in their paradigmatic achievements. As a result, threats to the paradigm become existential threats. A healthy distinction between self and work is needed to avoid defensive reactions to valid criticism. Clinical psychologists should be the first to realize the importance of this recommendation for the long-term well-being of their science and their own well-being.

The Evaluative Factor in Self-Ratings of Personality Disorder: A Threat to Construct Validity


This blog post reanalyzes data from a study that examined the validity of the Levels of Personality Functioning Scale and the Personality Inventory for DSM-5. I show that the halo-Big Five model fits the correlations among the 25 DSM-5 scales and that the halo factor shows high convergent validity with the Levels of Personality Functioning factor. Whereas personality disorder researchers interpret the general factor in self-report measures of personality functioning as a broad disposition, research on this factor with measures of normal personality suggests that it reflects a response style that is unique to self-ratings. While the evidence is not conclusive, it is problematic that personality disorder researchers ignore the potential contribution of response styles in the assessment of personality disorders with self-reports.


Concerns about the validity of self-reports are as old as self-reports themselves. Some psychologists distrust self-reports so much that they interpret low correlations between self-ratings and behavioral measures as evidence that the behavioral measure is valid (Greenwald et al., 1998). On the other hand, other psychologists often uncritically accept self-reports as valid measures (Baumeister, Campbell, Krueger, Vohs, 2003). The uncritically acceptance of self-reports may be traced back to the philosophy of operationalism in psychology. Accordingly, constructs are defined by methods as in the infamous saying that intelligence is whatever an IQ test measures. Similarly, personality traits like extraversion might be operationalized by self-reports of personality. Accordingly, extraversion is whatever a self report measure of extraversion measures.

Most psychometricians today would reject operationalism and distinguish between constructs and measures. As a result, it is possible to critically examine whether a measure measures the construct it was designed to measure. This property of a measure is called construct validity (Cronbach & Meehl, 1955). From this perspective, it is possible that an IQ test may be a biased measure of intelligence or that a self-report measure of extraversion is an imperfect measure of extraversion. To examine construct validity, it is necessary to measure the same construct with multiple (i.e., at least two) independent methods (Campbell & Fiske, 1959). If two independent measures measure the same construct, they should be positively correlated. This property of measures is called convergent validity.

The most common approach to measure personality with multiple methods is to ask acquaintances to provide informant reports of personality. This approach has been used to demonstrate that self-ratings of many personality traits have convergent validity with reports by others (Connelly & Ones, 2010). The same method has also been used to demonstrate convergent validity for the measurement of maladaptive personality traits (Markon, Quilty, Bagby, & Krueger, 2013). These studies also show that convergent validity is lower than the reliability of self-ratings and informant ratings. This finding indicates that some of the reliable variance in these ratings is method variance (Campbell & Fiske, 1959). To increase the validity of self-ratings of personality, it is necessary to examine the factors that produce method variance and minimize their contribution to the variance in self-ratings.

Research on the unique variance in self-ratings of personality has demonstrate that a large portion of this variance reflects a general evaluative factor (Anusic, Schimmack, Lockwood, & Pinkus, 2009). This factor is present in self-ratings and informant ratings, but does not show convergent validity across raters (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). Moreover, it is related to other measures of self-enhancement such as inflated ratings of desirable traits like attractiveness and intelligence (Anusic et al., 2009). It also predicts self-ratings of well-being, but not informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2012), suggesting that it is not a substantive trait. Finally, using personality items that are less evaluative, reduces correlations among personality factors (Bäckström & Björklund, 2020). Taken together, these findings suggest that self-reports are influenced by the desirability of traits and that a consistent bias produces artificial correlations between items. This bias is often called halo bias (Thorndike, 1920) or socially desirable responding (Campbell & Fiske, 1959).

It seems plausible that socially desirable responding is an even bigger problem for the use of self-reports in the measurement of personality disorders that are intrinsically undesirable (i.e, nobody wants to have a disorder). Yet, researchers of personality disorders have largely ignored the possibility that socially desirable responding biases self-ratings of personality. Rather, they have interpreted the presence of a general evaluative factor as evidence for a substantive factor that is either interpreted as a broad risk factor in a hierarchical model of factors that contribute to personality disorders (Morey, Krueger, & Skodol, 2013) or as an independent factor that reflects severity of personality disorders (Morey, 2017). These substantive interpretations have been challenged by evidence that the general factor in self-reports of personality disorders is highly correlated with the halo factor in self-ratings of normal personality (McCabe, Oltmanns, & Widiger, 2022). Using existing data, I was able to show that the halo factor in self-ratings of the Big Five personality factors was highly correlated with the general factor in the Levels of Personality Functioning Scale (Morey, 2017), r = .88 (Schimmack, 2022a), and the general factor in the Computerized Adaptive Test of Personality Disorders (CAT-PD), r = 94 (Schimmack, 2022b). In addition, the general factor in the Levels of Personality Function items is highly correlated with the general factor in the CAT-PD items, r = .86. These results suggest that the same factor contributes to correlations among self-ratings of personality and that this factor reflects the desirability of the items.

In this post, I extend this investigation to another measure of maladaptive personality traits, namely the Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012). I also provide further evidence about the amount of variance in PID-5 scales that is explained by the general factor. McCabe et al.’s (2022) findings suggested that a large amount of the variance in some scales reflects mostly the general factor. For example, the general factor explained over 60% of the variance in Perceptual Dysregulation, Unusual Beliefs, Deceitfulness, Irresponsibility, Distractibility, and Impulsivity. If these self-report measures were used to diagnose personality disorders, it is vital to examine whether this variance reflects substantive problems or a mere response style to agree or disagree with desirable items.

I used Hopwood et al.’s (2018) data from Study 2 to fit a model to the correlations among the 25 PID-5 scales ( The dataset also included ratings for the Levels of Personality Functioning Scale (Morey, 2017). Based on previous analyses, I used 10 items to validate the general factor in the PID-5 (Schimmack, 2022a). To ensure robustness of the model, I fitted the same model to two random splits of the full dataset and retained only parameters that were statistically significant across both models. The final model had acceptable fit, CFI = .928, RMSEA = .06, and better fit than exploratory factor analyses of the PID-5 (Markon, Quilty, Bagby, & Krueger, 2013).

The main finding was that the general factor in the PID-5 correlated r = .837, SE = .016, with the factor based on the 10 LPFS items. This finding supports the hypothesis that the same, or at least highly similar factors, influence self-ratings on measures of normal personality and maladaptive personality. Table 1 shows the loadings of the 25 PID-5 scales on the general factor and on five additional factors that are likely to correspond to the Big-Five factors of normal personality. It also shows the contribution of unique factors to each scale that may be valid unique variance of dysfunctional personality.

The results replicate McCabe et al.’s (2022) finding that all PID-5 scales load on a general factor. Although the loadings are not as high, they are substantial and 23 out of the 25 loadings are above .5. Table 1 also shows that all scales have unique variance that is not explained by the factors of the halo-Big5 model. 18 of the 25 Loadings on the uniqueness factor are above .5. Finally, the loadings on the Big Five factors are consistent with factor analyses of the PID-5, but the magnitude of these loadings is relatively modest. Only 7 of the 25 loadings are above .5, and only 5 of the 25 scales have a higher loading on a Big-Five factor than on the general factor. These results are consistent with a similar analysis of the CAT-PD scales (Schimmack, 2022b).

In conclusion, self-reports of maladaptive personality that have been proposed as instruments for clinical diagnoses of personality disorders are strongly influenced by a general factor that is common to different instruments such as the Levels of Personality Functioning Scale (Morey, 2017), the Computerized Adapative Test of Personality Disorders (CAT-PD; Simms, Goldberg, Roberts, Watson, Welte, & Rotterman, 2011), and the Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012).


Concerns about the influence of response styles on self-ratings are as old as self-reports. Campbell and Fiske (1959) demonstrated that self-ratings of personality traits are more highly correlated with each other than with informant ratings of the same traits. Confirmatory factor analyses of multi-trait-multi-rater data revealed the presence of an evaluative factor that is largely independent across different raters (Anusic et al., 2009). This factor has several names, but it reflects the desirability of items independent of their descriptive meaning. Clinical researchers interested in personality disorder also observed a general factor, but interpreted it as a substantive factor. As I demonstrated in several studies, the halo factor in ratings of normal personality is strongly correlated with the general factor in self-report instruments to diagnose personality disorders. This finding challenges the prevalent interpretation of the general factor as a valid dimension of personality disorder. At a minimum, the results suggest that ratings of maladaptive and undesirable traits are influenced by socially desirable responding. Despite the long history of research on socially desirability, researchers who study personality disorders have downplayed or ignored this possibility. For example, McCabe et al. (2022) dismiss the response style explanation with the argument that “it is more likely that persons are providing reasonably forthright self-descriptions,” while ignoring the finding that the evaluative factor in self-ratings lacks convergent validity with informant ratings (Anusic et al., 2009), and mentioning earlier that Anusic et al.’s (2009) results provided support for this hypothesis. They also admit that their results “are consistent with, or at least don’t rule out, the social undesirability or evaluative bias hypothesis” (p. 151), but then conclude that “some persons do indeed have many undesirable traits whereas other persons have many desirable traits” (p. 151) without citing any evidence for this claim. In fact, it is much more common for respondents to have none of the personality disorders (i.e., a score in the upper 10% of a scale, 44%) and few participants have more than 10 disorders (5%). This asymmetry is more consistent with a response style that attenuates scores on undesirable traits than a broad dispositions that make some people have many undesirable traits.

Markon et al. (2013) advocate the use of informant ratings to control for response styles in self-ratings, PID-5, but do not use the informant ratings to examine the presence of biases in self-ratings. More broadly, numerous articles claim to examine the validity of personality disorder instruments and all of these articles conclude that these instruments are valid (Krueger et al., 2012; Long, Reinhard, Sellbom, & Anderson, 2021; Morey, 2017; Simms et al., 2011). Some authors are also inconsistent across articles. For example, Ringwald, Manuck, Marsland, and Wright (2022) note that “many studies suggest it [the general factor] is primarily the
product of rater-specific variance” (p. 1316), but Ringwald, Emery, Khoo, Clark, Kotelnikova, Scalco, Watson, Wright and Simms (2022) neither model the general factor, nor mention that response styles could influence scores on the CAT-PD scales. Evidence that the general factor in personality disorder instruments is strongly correlated with the evaluative factor in ratings of normal personality requires further investigation. Claims that personality disorder measures are valid are misleading and fail to acknowledge the possibility that response styles produce method variance. The presence of method variance does not invalidate measures because validity is a quantitative construct (Cronbach & Meehl, 1955). Markon et al. (2013) demonstrate convergent validity between self-reports and informant ratings of the PID-5 traits. Thus, there is evidence that self-ratings have some validity. The goal of future validation research should be to identify method factors and develop revised measures with higher validity as personality researchers are trying to reduce the evaluative bias in measures of normal personality (Bäckström & Björklund, 2020; Wood, Anglim, & Horwood, 2022; ). However, this might be more difficult for measures of disorders because disorders are intrinsically undesirable. Thus, it may be necessary to use statistical controls or a multi-rater assessment to increase the validity of self-report instruments designed to measure maladaptive traits.

Meanwhile, personality disorder researchers continue to disregard the possibility that a large portion of the variance in self-report measures is merely a response style and make claims about construct validity based on inappropriate methods to separate valid construct variance from method variance (e.g., Widiger & Crego, 2019). Most of the claims that personality disorder instruments are valid are based on correlations of one self-report measure with another or evidence that factor analyses of personality disorder scales have a similar factor structure to factor analysis of normal personality traits (i.e., the Big Five). Neither finding warrants the claim that maladaptive personality scales measure maladaptive personality traits. Instead, the finding that the halo-Big Five model can be fitted to correlations among personality disorder scales suggests that these scales merely have more evaluative content and are more strongly influenced by socially desirable responding. Multi-method evidence is needed to demonstrate that the general factor reflects a substantive trait and that specific traits are maladaptive; that is, produce intrapersonal or interpersonal problems for individuals with these traits. For now, claims about the validity of personality disorder instruments are invalid because they fail to meet basic standards of construct validity and fail to quantify the amount of method variance in these scales (Campbell & Fiske, 1959; Cronbach & Meehl, 1955).

Validity of the Computerized Adaptive Test of Personality Disorders (CAT-PD)


Clinical psychologists have developed alternative models of personality disorders to the traditional model that was developed by psychiatrists in the 20th century. The new model aims to integrate modern research on normal personality with clinical observations of maladaptive traits. The alternative model aims to explain why specific personality disorders are related, which is called co-morbidity in the medical literature. Ringwald et al. (2022) tested this assumption with the Comprehensive Adaptive Test of Personality Disorders (CAT-PD). They fitted a model with six correlated factors to the covariance matrix among the 33 CAT-PD scales. Four of these six factors overlapped with Big Five factors. They concluded that their study supports “the validity of the CAT-PD for assessing multiple levels of the pathological trait hierarchy” I fitted a model of normal personality to the data. This model assumes that self-ratings of personality are influenced by 6 independent traits, the Big Five and a general evaluative factor called halo. I was able to identify all six factors in the CAT-PD, although additional relationships among the 33 scales were also present. I cross-validated this model and showed high (r > .8) correlations of the factors with factors in a Big Five questionnaire. I show that Big Five factors explain only a modest amount of variance in most CAT-PD scales. Based on these results, I conclude that these factors reflect normal variation in personality rather than a distinct level in a hierarchical model of pathological traits. Rather, the Big Five traits are normal traits that are risk factors for specific types of personality disorders, but extreme levels of a normal trait are normal and not pathological. Furthermore, a large portion of the variance in self-ratings of traits are method variance. Thus, valid assessment of personality disorders requires a multi-rater approach.


The notion of personality disorders has a long history in psychiatry that is based on clinical observations and psychoanalytic theories. It is currently recognized that the old system to diagnose personality disorders is no longer compatible with modern theories of personality, but there is no consensus among clinical psychologists and psychiatrists about the definition and assessment of personality disorders. This confusing state of affairs is reflected in the presence of several competing conceptualizations of personality disorders in the Diagnostic and Statistical Model of Mental Disorders (DSM-5).

Simms et al. (2011) introduced the Computerized Adaptive Test of Personality Disorders (CAT-PD) as one potential model of personality disorders. The CAT-PD aims to measure 33 maladaptive personality traits (CAT-PD-SF). In this blog post, I take a critical look at the claim that the CAT-PD is capable of measuring personality disorders at varies levels in a hierarchical model of personality functioning (Ringwald, Emery, Khoo, Clark, Kotelnikova, Scalco, Watson, Wright, & Simms, 2022).

The notion of a hierarchy of disorders implies that the 33 dimensions of the CAT-PD measure distinct disorders, where more extreme levels on these dimensions indicate higher levels of personality dysfunction. Correlations among the scales measuring the 33 dimensions suggest that they share a common cause. This causes explain why some primary disorders covary (i.e., comorbidity in the terminology of categorical diagnoses). They may also reflect broader dimensions of disorders. Ringwald et al. (2022) used confirmatory factor analysis to test this hypothesis. They tested one model with five-factors and another model with 6-factors. The 6-factor model had better fit. Thus, I focus on the six factor model. The six factors are called Negative Affectivity, Detachment, Disinhibition, Antagonism, Psychoticism, and Anankastia.

Table 1 shows the CAT-PD scales with the highest loadings on these factors.

Table 1
Negative Affectivity: Anxious, Affect Lability
Detachment: Social Withdrawal, Anhedonia
Antagonism: Callousness, Domineering
Disinhibition: Non-Planfulness, Irresponsibility
Psychoticism: Unusual Experiences, Unusual Beliefs
Anankastia: Workaholism, Perfectionism

The CFA model imposed no restrictions on the correlations among the six factors and an inspection of the correlation matrix showed that the six factors are correlated to varying degrees (Table 2).

Although some of these correlations are moderate to strong, the results are consistent with the assumption that all six dimensions reflect different constructs (discriminant validity). The authors discuss the surprising finding that disinhibition (e.g., Non-Planfulness) and Anankastia (e.g., Perfectionism) appear as independent factors. This would suggest that some people have both disorders although they seem to be related to low and high conscientiousness, respectively. Correlations with an independent measure of conscientiousness shows that inhibition is negatively correlated with conscientiousness, r = -.64, but Anankastia was not positively correlated with conscientiousness, r = .11. There are two explanations for these results. One explanation is that there is a general factor of personality functioning that has a positive influence on all personality disorders, even if they seem to express themselves in seemingly opposite ways. That is, general functioning increases the risk of being irresponsible and perfectionistic. An alternative explanation is that self-reports of personality disorders are influenced by the same response styles that influence self-ratings of normal personality (Anusic et al., 2009). Either interpretation implies that a general factor contributes to the pattern of correlations among the CAT-33 scales. The authors discuss this issue in their limitation section.

“A challenge facing modeling bipolar factors is the shared impairment that generally creates positive manifolds in the correlations among maladaptive scales. This can be circumvented with separate modeling of impairment in a distress or dysfunction factor, as has long been done in the IIP-SC (Alden et al., 1990) but has not been attempted in any comprehensive way in a published five- or six-factor pathological trait inventory.” (p. 26)

It is not clear why the authors did not fit this model to their data. I pursued this avenue of future research based on measurement models of normal personality. Accordingly, it is possible to distinguish a general evaluative factor (halo) that produces positive correlations among desirable traits and the Big Five as largely independent factors. I refer to this model as the Halo-Big-Five model. Four of the Big Five factors are closely related, if not identical, with four of the CAT-PD factors, namely Neuroticism corresponds to Negative Affectivity, Detachment corresponds to Introversion (low Extraversion), Antagonism corresponds to low Agreeableness, and Disinhibition corresponds to low Conscientiousness. Openness is not strongly related to personality disorder, but the item content of the Fantasy Proneness scale (e.g., I sometimes get lost in daydreams) and the Peculiarity scale (“I am considered to be eccentric”) might be related to Openness. Furthermore, the correlation between the two scales was high, r = .588. This was the highest correlation for both items, except for equally high correlations with the Cognitive Problems scale (e.g., “I often space out and lose track of what’s going on.”, r = .64, .57. Thus, these items were used as makers of an Openness to Experience factor. Modification indices were used to allow for additional loadings on these predefined theoretical factors, but model fit remained lower than the fit of the 6-factor model, suggesting additional factors were present. I split the dataset into random halves and created a model that generalized across the two halves. The model is a lot simpler than the six-factor model (405 vs. 345 degrees of freedom) because it did not use free parameters for theoretically unimportant and small parameters. This explains why the model has superior fit to the 6-factor model for fit indices that take simplicity into account.

The main limitation is the lack of empirical evidence that factors correspond to the same factors that can be found with self-ratings of normal personality traits. To examine this question, I used a dataset ( that included ratings of normal personality on the Big-Five Inventory-2 and ratings on the Levels of Personality Functioning scale s (Hopwood et al., 2018). I first fitted the halo-Big Five model to the covariances among the 33 CAT-PD scales. Overall model fit was lower, indicating some differences between the datasets, but overall model fit was acceptable, CFI = .939, RMSEA = .060. All parameters were replicated with alpha = .05 as being statistically significant. Thus, the model shows some generalizability across datasets. Then I combined this model with a prior model of the Big Five and a Level of Personality Functioning factor (Schimmack, 2022). Given the large number of items, I further simplified the model by using the BFI-2 facet scales as indicators for the Big Five factors. This reduced the number of normal personality variables from 60 items to 15 scales. This model had acceptable fit, CFI = .902, RMSEA = .066.

I combined this model with the CAT-PD model by allowing the general factors and the corresponding Big Five traits to correlate. In addition, I allowed for correlated residuals between facets and related CAT-PD scales. For example, I allowed for unique relations between the Depressiveness facet in the BDF-I and the Depression scale of the CAT-PD. The combined model retrained acceptable fit, CFI = .892, RMSEA = .058. The key finding was that the CAT-PD factors were highly correlated with the BFI-2 factors and the general CAT-PD factor was highly correlated with the LPFS factor (Table 3).

These results provide empirical evidence for the interpretation of the CAT-PD factors as the Big-Five factors of normal personality As a result, it is possible to describe the variance in CAT-PD scales as a function of variation in (a) a general factor that reflects desirability of a trait, (b) variance that is explained by variation in normal personality, and (c) residual variance that may reflect maladaptive expressions of normal personality. Table 4 shows how much these different factors contribute to variance in the 33 CAT-PD scales.

only effect sizes > .2 (4% explained variance) are shown

The general factor makes a strong contribution for most CAT-33 scales. 21 of the 33 effect sizes are larger than .6 (36% explained variance), and only 4 effect sizes are below .4 (16% explained variance), namely Exhibitionism, .33, Romantic Disinterest, .28, Perfectionism, .37, and Workaholism, .32. In comparison, the effect sizes for the Big Five traits are more moderate. Only 3 effect sizes are above .6, namely Anxiousness (N, .60), Exhibitionism (E, .68), and Social Withdrawal (E, -.66). As a result, most CAT-PD scales have a substantial amount of unique variance that is not explained by the general or the Big Five factors. 19 of the 33 effect sizes were above .6 (36%) explained variance, and not a single effect size was below .4 (16% explained variance). Although these effect sizes may be inflated by random and systematic measurement error, the results suggest that the constructs that are measured with the CAT-33 scales are related, but not identical to factors that produce variation in normal personality.


Normal Personality Factors and Maladaptive Personality Traits

Correlations of personality measures with other measures provide valuable information about the construct validity of personality measures (Cronbach & Meehl, 1955; Schimmack, 2021). Unfortunately, there are no generally accepted psychometric standards to evaluate construct validity. Ringwald et al. (2022) claim that their results provide evidence of construct validity and evidence that the CAT-PD can be used to measure personality pathology at multiple levels of a hierarchy. I think this conclusion is premature and ignores key steps in a program of validation research. First, construct validation requires a clear definition of a construct that is the target of psychological measurement. After all, it is impossible to evaluate whether a measure measures an intended construct, if the construct is not properly defined. Moreover, the CAT-PD has 33 scales and each scale is intended to measure a distinct construct. Thus, construct validation of the CAT-PD requires clear definitions of 33 constructs. The concepts are well-defined, but it is questionable that all of these constructs can be considered disorders (CAT-PD Manual). For example, the Domineering scale is intended to measure “a general need for power and the tendency to be controlling, dominant, and forceful in interpersonal relationships” and the Submissiveness scale is intended to measure “the yielding of power to others, over-accommodation of others’ needs and wishes, exploitation by others, and lack of self-confidence
in decision-making, often to the extent that one’s own needs are ignored, minimized, or undermined.” The labeling of these scales as measures of personality pathology implies that variation along these dimensions is pathological and that the scales are valid measures of actual behavioral tendencies that cause intrapersonal or interpersonal problems for individuals who score high on these scales. As the CAT-PD is a relatively novel questionnaire, there is insufficient evidence to show that the CAT-PD scales assess pathology rather than normal variation in personality. An even bigger problem is the claim that the CAT-PD can be used to measure multiple levels in a hierarchy of personality disorders. The first problem is the assumption that disorders have a hierarchical structure. It would be difficult to understand the notion of a hierarchy for physical disorders. Let’s take cancer as an example (Fowler et al., 2020). Cancers are distinguished by their location such as lung cancer, breast cancer, brain cancer and so on. As cancer can spread some patients may have more than one cancer (i.e., cancer in multiple locations). While some cancers are more likely to co-occur, nobody has proposed a hierarchy of cancers, in which known or unknown causes of co-morbidity are considered a disorder. In probabilistic models it is also problematic to call all causes of a disorder a disorder. For example, not having sickle cell anemia is a risk factor for malaria infection, but it would be questionable at best to call normal blood cells a disorder.

I demonstrated that the Big Five factors explain some, but not all, of the correlations among the CAT-PD scales. Ringwald et al.’s factor model is similar and the authors come to a similar conclusion that four of their six factors correspond to four of the Big Five scales. They even claim that there is convergent validity between measures of the Big Five and the CAT-PD factors. The problem is that convergent validity implies that two measures measure the same construct (Campbell & Fiske, 1959). However, the Big Five factors are used to describe the correlations among traits that describe normal variation in personality. In contrast, Ringwald et al. (2022) claim that their factors reflect a level in a hierarchy of personality pathology. Unless we pathologize normal personality or normalize pathology, these are different constructs. Thus, high correlations between Big Five factors and CAT-PD factors do not show convergent validity. Rather they show a lack of discriminant validity (Campbell & Fiske, 1959). There is, however, a simple way to reconcile the notion of personality disorders with the finding that Big Five factors are related to measures of personality disorders. It is possible to consider the Big Five factors as risk factors for specific personality disorders. For example, high agreeableness may be a risk factor for dysfunctional forms of submissiveness and low agreeableness could be a risk factor for dysfunctional forms of dominance. Importantly, high or low agreeableness alone is not sufficient to be considered a disorder. This model is consistent with the substantial amount of variance in CAT-PD scales that is not explained by variation in normal personality. The key difference between this model and Ringwald’et al’s model is that covariance among CAT-PD scales does not reflect a broader disorder, but normal variation in personality. One advantage of this model is that it can explain the weak correlations between Big Five traits, with the exception of Neuroticism, and well-being (Schimmack, 2022). If the Big Five were broader pathological traits, we should expect that they lower quality of life. It is more likely that personality traits are risk factors, but that the actual manifestation of a disorder lowers well–being. This model predicts that the unique variance in CAT-PD scales is related to lower well-being and mental health problems. This needs to be examined in future studies.

General Factor

The other main finding was that a general factor explains a large amount of variance in many CAT-PD scales and that this factor is strongly correlated with the halo factor in self-ratings of normal personality. Some researchers interpret this factor as a substantive factor, whereas others view it as a response artifact. The present findings create some problems for the interpretation of this general factor that produces co-morbidity among personality disorders because it is related to opposing disorders. Taking Domineering and Submissiveness as an example, the general factor is positively related to Domineering, .65, and Submissiveness, .61. it is unclear, how a substantial trait could make somebody dominant and forceful in interpersonal relationships and over-accommodating of others’ needs. A more plausible explanation is that some respondents respond to the negative description of these traits and present themselves in an overly positive manner. This is consistent with multi-rater studies of normal personality that show low correlations for the general factors of different raters (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). Similar studies with measures of personality disorders are lacking. Markon, Quilty, Bagby, and Krueger compared self-ratings and informant ratings and found moderate agreement. However, the sources of disagreement remained unknown. A multi-trait-multi-rater analysis of these data could reveal the amount of rater-agreement for the general factor in PD ratings.


I presented evidence that the halo-Big-Five model fits self-ratings of normal personality and ratings of personality disorders and that the corresponding factors are very highly (r > .8) correlated with each other. This finding raises concerns about hierarchical models of personality disorders. I present an alternative model that considers normal personality as a risk factor for specific personality disorders and the halo factor as a rating bias in self-ratings. Future research needs to go beyond self-ratings to separate substance from style. Furthermore, indicators of mental health and well-being are needed to distinguish normal personality from personality disorders.

The Levels of Personality Functioning Scale Lacks Construct Validity


An influential model of personality disorders assumes a general factor of personality functioning that underlies the presence of personality disorder symptoms. To measure this factor, Morey (2017) developed the Level of Personality Functioning scale. The construct and the measure of general personality functioning, however, remains controversial. Here I analyze data that were used to claim validity of the LPFS using structural equation modeling. I demonstrate that two factors account for 88% of the variance in LPFS scores. One factor reflects desirability of items (70%) and the other factor reflects scoring of the items (12%). I then show that the evaluative factor in the LPFS corelates highly, r = .9, with a similar evaluative factor in ratings of normal personality. Based on previous evidence from multi-method studies of normal personality, I interpret this factor as a response style that is unique to individual raters. Thus, most of the variance in LPFS scores reflects evaluative rating biases rather than levels of personality functioning. I also identified 10 items from the LPFS that are mostly free of actual personality variance, but correlate strongly with the evaluative factor. These items can be used as an independent measure of evaluative biases in self-ratings. The main conclusion of this article is that theories of personality disorders lack a clear concept and that self-report measures of personality disorders lack construct validity. Future research on personality disorders need to conduct more rigorous construct validation research with philosophically justifiable definitions of disorders and multi-method validation studies.


A major problem in psychology is that it is too easy to make up concepts and theories about human behaviors that are based on overgeneralizations from single incidences or individuals to humans in general. A second problem is that pre-existing theories and beliefs often guide research and produce results that appear to confirm those pre-exist believes. A third problem is that psychology lacks a coherent set of rules to validate measures of psychological constructs (Markus & Borsboom, 2013). As a result, it is possible that large literatures are based on invalid measures (e.g., Schimmack, 2021). In this blog post, I will present evidence that an influential model of personality disorders is equally based on flawed measures.

What Are Personality Disorders?

The notion of personality disorders has a long history that predates modern conceptions of personality (Zacher, 2017). An outdated view, equated personality disorders with extreme – statistically abnormal – scores on measures of personality (Schneider, 1923). The problem with this definition of disorders is that abnormality can even be a sign of perfect functioning as in the performance of a Formula 1 race car or an Olympic athlete.

Personality disorders were formalized in the third Diagnostic and Statistical Manual of Mental Disorders, but the diagnosis of personality disorders remained controversial; at least, much more controversial than diagnosis of mental disorders with clear symptoms of dysfunction such as delusions and hallucinations. The current DSM-5 contains two competing models of personality disorders. Without a clear conception of personality disorders, the diagnosis of personality disorders remains controversial (Zacher, 2017).

A main obstacle in developing a scientific model of personality disorder is that historic models of personality disorders are difficult to reconcile with contemporary models of normal functioning personality that has emerged in the past decades. To achieve this goal, it may be necessary to start with a blank slate and rethink the concept of personality disorders.

Distinguishing Personality Disorders from (Normal) Personality

There is no generally accepted theory of personality. However, an influential model of personality assumes that individuals have different dispositions to respond to the same situation. These dispositions develop during childhood and adolescence in complex interactions between genes and environments that are poorly understood. By the beginning of early adulthood, these dispositions are fairly stable and change only relatively little throughout adulthood. While there are hundreds of dispositions that influence specific behaviors in specific situations, these dispositions are related to one or more of five broad personality dispositions that are called the Big Five. Neuroticism is a general dispositions to experience more negative feelings such as anxiety, anger, or sadness. Extraversion is a broad disposition to be more engaged that is reflected in sociability, assertiveness, and vigor. Openness is a general disposition to engage in mental activities. Agreeableness is a general disposition to care about others. Finally, conscientiousness is a general disposition to control impulses and persist in the pursuit of long-term goals. Variation along these personality traits is considered to be normal. Variation along these traits exists either because it has no major effect on life outcomes, the genetic effects are too complex to be subjected to selection, or because traits have different costs and benefits. This short description of normal personality is sufficient to discuss various models of personality disorders (Zachar & Krueger, 2013).

The vulnerabiltiy model of personality disorders can be illustrated with high neuroticism. High neuroticism is a predictor of lower well-being and a risk factor for the development of mood disorders. Even during times when individuals are not have clinical levels of anxiety or depression, they report elevated levels of negative moods. Thus, one could argue that high neuroticism is a personality disorder because it makes individuals vulnerable to suffer mental health problems. However, even in this example it is not clear whether neuroticism should be considered a risk factor for a disorder or a disorder itself. As many mood disorders are episodic, while neuroticism is stable, one could argue that neuroticism is a risk factor that only in combination with other factors (e.g., stress) triggers a disorder. The same is even more true for other personality traits. For example, low conscientiousness is one of several predictors of some criminal behaviors. This finding might be used to argue that low conscientiousness is a criterion to diagnose a personality disorder (e.g., psychopathy). However, it is also possible to think about low conscientiousness as a risk factor rather than a diagnostic feature of a personality disorder. In line with this argument, Zachner and Kruger (2013) suggest that “vulnerabilities are not disorders” (p. 1020). A simple analogy may suffice. White skin is a risk factor for skin cancer. This does not mean that White skin is a skin disease and it is possible to avoid the clinically relevant outcome of skin cancer by staying out of the sun, proper closing, or applying sun blockers. Even if we would recognize that personality can be a risk factor for various disorders, it would not justify the label of a personality disorder. The term implies that something about a person’s personality impedes their proper functioning. In contrast, the term risk factor merely implies that personality can contribute to the disfunction of something else.

The pathoplasticity model uses the term personality disorder for personality traits that influence the outcome of other psychiatric disorders. Zachar and Kruger (2013) suggest that people with a personality disorder develop mental health problems earlier in life or more often. This merely makes them risk factors, which were already discussed under the vulnerability model. More broadly personality traits may influence specific behaviors of patients suffering from mental health problems. For example, personality may influence whether depressed patients commit suicide or not. For example, men are more likely to commit suicide than women despite similar levels of depression. Understanding these personality effects is surely important for the treatment of patients, but it does not justify the label of personality disorders. In this example, the disorder is depression and treatment has to assess suicidality. The personality factors that influence suicidality are not part of the disorder.

The spectrum model views personality disorders as milder manifestations of more severe mental health problems that share a common cause. This model blurs the distinction between normal and disordered personality. At what level is anxiety still normal and at what level is it a mild manifestation of an anxiety disorder. A more reasonable distinction between normal and clinical anxiety is whether anxiety is rational (e.g., gun fire at a mall) or irrational (fear of being abducted by aliens). Models of normal personality traits are not able to capture these distinctions.

The decline-in-functioning model assumes that personality disorders are the result of traumatic brain injury, severe emotional trauma, or severe psychiatric disorder. As all behavior is regulated by the brain, brain damages can lead to dramatic changes in behavior. However, it seems odd to call these changes in behaviors a personality disorder. With regards to traumatic life events, it is not clear that they reliably produce major changes in personality. Avoidance after a traumatic injury is typically situation specific rather than a change in a broader general disposition. This model also ignores that the presence of a brain injury, other mental illnesses or drugs is used as an exclusion criterion to diagnose a personality disorder (Skodol et al., 2011).

The impairment-distress model more directly links personality to disorder or dysfunction. The basic assumption is that personality is associated with clinically significant impairment or distress. I think association is insufficient. For example, gender is corelated with neuroticism and the prevalence of anxiety disorders. It would be difficult to argue that this makes gender a personality disorder. To justify the notion of a personality disorder, personality needs to be a cause of distress and treatment of personality disorders should alleviate distress. Once more, high neuroticism might be the best candidate for a personality disorder. High neuroticism predicts higher levels of distress and treatment with anti-depressant medication or psychotherapy can lower neuroticism levels and distress levels. However, the impairment-distress model does not solve the problems of the vulnerability model. Is high neuroticism sufficient to be considered an impairment or is it merely a risk factor that can lead to impairment in combination with other factors?

This leaves the capacity-failure model as the most viable conceptualization of a personality disorder (Zachar, 2017). The capacity-failure model postulates that personality disorders represent dysfunctional deviations from the normal functions of personality. This model is a straightforward extension of conceptions of bodily functioning to personality. Organs and other body parts have clear functions and can be assessed in terms of their ability to carry out these functions (e.g., hearts pump blood). When organs are unable to perform these functions, patients are sick and suffer. Zachar (2017) points out a key problem of the extension of biological functions to personality. “The difficulty with all capacity failure models is that they rely on speculative inferences about normal, healthy functioning” (p. 1020). The reason is that personality refers to variation in systems and processes that serve a specific function. While the processes have a clear function, it is often less clear what function variation in these processes serves. Take anxiety as an example. Anxiety is a universal human emotion that evolved to alert people to potential danger. Humans without this mechanism might be considered to have a disorder. However, neuroticism reflects variation in the process that elicits anxiety. Some people are more sensitive and others are less sensitive to danger. To justify the notion of a personality disorder, it is not sufficient to specify the function of anxiety. It is also necessary to specify the function of variation in anxiety across individuals. This is a challenging task and current research on personality disorders has failed to specify personality functions to measure and diagnose personality disorders from a capacity-failure model.

To summarize, the reviewed conceptualizations of personality disorders provide insufficient justification for a distinction between normal personality and personality disorders. While some personality types may be associated with some negative outcomes, these correlations do not provide an empirical basis for a categorical distinction between personality and personality disorders. This leaves the capacity-failure model as the last option (Zachar, 2017). The capacity-failure model postulates that personality disorders represent dysfunctional deviations from the normal functions of personality. This model is a straightforward extension of conceptions of bodily functioning to personality. Organs and other body parts have clear functions and can be assessed in terms of their ability to carry out these functions (e.g., hearts pump blood). When organs are unable to perform these functions, patients are sick and suffer. Zachar (2017) points out a key problem of the extension of biological functions to personality. “The difficulty with all capacity failure models is that they rely on speculative inferences about normal, healthy functioning” (p. 1020). That is, while it is relatively easy to specify the function of body parts, it is difficult to specify the functions of personality traits. What is the function of extraversion or introversion? The key problem is that personality refers to variation in basic psychological processes. While we can specify the function of being selfish or altruistic, it is much harder to specify the function of having a disposition to be more selfish or more altruistic (agreeableness). However, without a clear function of these personality dispositions, it is impossible to define personality dysfunction. This is a challenging task and current research on personality disorders has failed to specify personality functions that could serve as a foundation for theories of personality disorders.

The Criterion-A Model of Personality Disoders

Given the lack of a theory of personality disorders, it is not surprising that personality disorder have conflicting views about the measurement of personality disorders (it is difficult to measure something, if you do not know what you are trying to measure). One group of researchers argues for a one-dimensional model of personality disorders that is called personality pathology severity (Morey, 2017; Morey et al., 2022). This model is based on the assumption that specific items or symptoms that are used to diagnose personality disorders are correlated and “show a substantial first or general factor” (p. 650). To measure this general dimension of personality disorder with self-ratings, Morey (2017) developed the Levels of Personality Functioning Scale–Self Report (LPFS–SR).

A major problem of this measure is the lack of a sound conceptual basis. That is, it is not clear what levels of personality functioning are. As noted before, it is not even clear what function individual personality traits have. It is much less clear what personality functioning is because personality is not a unidimensional trait. Take a car as an analogy. One could evaluate the functioning of a car and order cars in terms of their level of functioning. However, to do so, we would evaluate the functioning of all of the cars parts and the level of functioning would be a weighted sum of the checks for each individual part. The level of functioning does not exist independent of the functioning of the parts. For the diagnosis of cars it is entirely irrelevant whether functioning of one part is related to functioning of another part. A general factor of dysfunction might be present (newer cars are more likely to have functioning parts than older cars), but the general factor is not the construct of interest. The construct of dysfunction requires assessing the functioning of all parts that are essential for a car to carry out its function.

In short, the concept of levels of personality functioning is fundamentally flawed. Yet, validation studies claim that the levels of personality function scale is a valid measure of the severity of personality disorders (Hopwood et al., 2018). Unfortunately, validation research by authors who developed a test is often invalid because they only look for information that confirms their beliefs (Cronbach, 1989; Zimmermann, 2022). Ideally, validation research would be carried out by measurement experts who do not have a conflict of interest because they are not attached to a particular theory. In this spirit, I examined the construct validity of the level of psychological functioning scale, using Hopwood et al.‘s (2018) data (

Structure of the LPFS-SR

Hopwood et al. (2018) did not conduct a factor analysis of the 80 LPFS-SR items. The omission of such a basic psychometric analysis is problematic even by the low standards of test validation in psychology (Markus & Borsboom, 2013). The reason might be that other researchers have already demonstrated that the assumed structure of the questionnaire does not fit the data (Sleep et al., 2020). Sleep et al. were also unable to find a model that fits the data. Thus, my analyses provide the first viable of the correlations among the LPFS-SR items. Viable, of course, does not mean perfect or true. However, the model provides important insights into the structure of the LPFS-SR and shows that many of the assumptions made by Morey (2017) are not supported by evidence.

I started with an exploratory factor analysis to examine the dimensionality of the LPFS-SR. Consistent with other analyses, I found that the LPFS-SR is multidimensional (Sleep et al., 2020). However, whereas Sleep et al. (2020) suggest that three or four factors might be sufficient, I found that even the Bayesian Information Criterion suggested 7 factors. Less parsimonious criteria suggested even more factors (Table 1).

I next examined whether the four-factor model corresponds to the theoretical assignment of items to the four scales. The criterion for model fit was that an item had the highest loading on the predicted factor and the factor loading was greater than .3. Using this criterion, only 33 of the 80 items had the expected factor loadings. Moreover, the correlations among the four factors were low. One factor had nearly zero correlations with the other three factors, r = .05 to .13. The correlations among the other three factors were moderate, r = .30 to .56, but do not support the notion of a strong general factor.

Exploratory factor analysis has serious limitations as a validation tool. For example, it is unable to model hierarchical structures, although Morey (2017) assumed a hierarchical structure with four primary and one higher-order factor. The most direct test of this model would require structural equation modeling (Confirmatory Factor Analysis). EFA also has problems separating content and method factors. As some of the items are reverse scored, it is most likely that acquiescence bias distorts the pattern of correlations. SEM can be used to specify an independent acquiescence factor to control for this bias (Anusic et al., 2009). Thus, I conducted more informative analysis with structural equation modeling (SEM) that are often called confirmatory factor analysis. However, the label confirmatory is misleading because it is seems to imply that SEM can only be used to confirm theoretical structures. However, the main advantage of SEM is that it is a highly flexible tool that can represent hierarchies, model method factors, and reveal residual correlations among items with similar content. This statistical tool can be used to explore data and to confirm models. A danger in exploratory use of CFA is overfitting. However, overfitting is mainly a problem for weak parameters that have little effect on the main conclusions. In my explorations, I set the minimum modification index to 20, which limits the type-I error probability to 1/129,128. Most parameters in the final model meet the 5-sigma criterion (z = 5, chi-square(1) = 25) that is used in particle physics to guard against type-I errors. Moreover, I posted all exploratory models ( and I encourage others to improve on my model.

The final model ( had acceptable fit according to the standard of .06 for the Root Mean Square Error of Approximation, RMSEA = .030. However, the Comparative Fit Index was below the criterion value of .95 that is often used to evaluate overall model fit, CFI = .922. Another way to evaluate the model is to compare it to the fit of the EFA models in Table 1. Accordingly, the model had better fit in a comparison of the Bayesian Information Criterion (179,033.304 vs. 181,643.255), Aikan’s Information Criterion (177,270,345 vs. 177,485.265), and RMSEA (.030 vs. 031), but not the CFI (.922 vs. 932). The difference between fit indices is explained by the trade-off between parsimony and precision. The CFA model is more parsimonious (2958 degrees of freedom) than the EFA model with 10-factors (2405 degrees of freedom). Using the remaining 554 degrees of freedom would produce even better fit, but at the risk of overfitting and none of the smaller MI suggested substantial changes to the model. The final model had 12 factors. that I will describe in order of their contribution to the variance in LPFS scale scores.

The most important factor is a general factor that showed notable positive loadings (> .3) for 64 of the 80 items (80%). This factor correlated r = .837 with the LPFS scale scores. Thus, 70% of the variance in scale scores reflects a single factor. This finding is consistent with the aim of the LPFS to measure predominantly a single construct of severity of personality functioning (y (Morey, 2017; Morey et al., 2022).). However, the presence of this factor does not automatically validate the measure because it is not clear whether this factor represents core personality functioning. An alternative interpretation of this factor assumes that it reflects a response style to agree more with desirable items that is known as socially desirable responding or halo bias (Anusic et al., 2009). I will examine this question later on when I relate LPFS factors to factors of normal personality.

The second factor reflects scoring of the items. All items were coded as directly coded (68) or reverse coded (12). For the sake of parsimony and identifiability, loadings on this factor were fixed to 1 or -1. Thus, all items loaded on this factor by definition. More important, this factor corelated r = .428 with LPFS scores. Thus, response sets explained another 18% of the variance in LPFS scores. Together, these two factors explained 70 + 18 = 82% of the total variance in LPFS scores.

The first content factor had 13 notable loadings (> .3). The highest loadings were for the items “Sometimes I am too harsh on myself” (.61), “The standards that I set for myself often seem to be too demanding, or not demanding enough” (.51)., and “I tend to feel either really good or really bad about myself.” (.483). This factor corelated only r = .154 with the LPFS scale scores. Thus, it adds at most 2% to the explained variance in LPFS scale scores. The contribution could be less because this factor is corelated with other content factors.

The second content factor had 8 notable loadings (> .3). The highest loadings were for the items “I have many satisfying relationships, both personally and on the job” (.487), “I work on my social relationships because they are important to me” (.445), and “Getting close to others just leaves me vulnerable and and isn’t worth the risk” (.440). This factor seems to capture investment in social relationships. The correlation of this factor with LPFS scores is r = .120 and the factor contributes at most 1.4% to the total variance of LPFS scores.

The third content factor had 6 notable loadings (> .3). The highest loadings were for the items “The key to a successful relationship is whether I get my needs met” (.490), “I’m only interested in relationships that can provide me with some comfort” (.476), “I can only get close to someone who can acknowledge and address my needs” (.416). This factor seems to reflect a focus on exchange versus communal relationships. It correlated r = .098 with LPFS scale scores and contributes less than 1% of the total variance in LPFS scores.

The 4th content factor had 7 notable loadings (> .3). The highest loadings were for the items “I have some difficulty setting goals” (.683), “I have difficulties setting and completing goals” (.639), and “I have trouble deciding between different goals” (.534). The item content suggests that this factor reflects problems with implementing goals. It correlates r = .070 with LPFS scores and explains less than 1% of the total variance in LPFS scores.

The 5th factor had only 3 notable loadings (> .3). The three items were “When others disapprove of me, it’s difficult to keep my emotions under control” (.572), “I have a strong need for others to approve of me” (..498), “In close relationships, it is as if I cannot live with the other person” (.334). This factor might be related to need for approval or anxious attachment. It correlates r = .057 with LPFS scores and explains less than 1% of the total variance in these scores.

The 6th factor had 4 notable loadings (> .3). The highest loadings were for the items “Feedback from others plays a big role in determining what is important to me” (.427), “My personal standards change quite a bit depending upon circumstances.” (.365), and “My motives are mainly imposed upon me, rather than being a personal choice.” (.322). This factor seems to capture a strong dependence on others. It correlates r = .050 with LPFS scores and contributes less than 1% of the total variance.

The 7th factor was a mini-factor with only three items and only one item had a loading greater than .3. The item was “My life is basically controlled by others.” The items of this factor all had secondary loadings on the previous factor, suggesting that it may be a method artifact and not a specific content factor. It correlated only r = .037 with LPFS scale scores and has a negligible contribution to the total variance in LPFS scores.

The 8th factor is also a mini-factor with only three items. Two items had notable loadings (> .3), namely “I can appreciate the viewpoint of other people even when I disagree with them” (.484) and “I can’t stand it when there are sharp differences of opinion” (. 379).

The 9th factor had 4 items with notable loadings (> .3), but two loadings were negative. The two items with positive loadings were “I don’t pay much attention to, or care very much about, the effect I have on other people” (.351) and “I don’t waste time thinking about my experiences, feelings, and actions” (.301). The two items with negative loadings were “My emotions rapidly shift around” (-.381) and “although I value close relationships, sometimes strong emotions get in the way” (-.319). This factor seems to capture emotionality. The correlation with LPFS scores is trivial, r = .008.

The 10th factor is also a mini-factor with only three items. Two items had notable loadings, namely “People think I am pretty good at reading the feelings and motives of others in
most situations” (-.567) and “I typically understand other peoples’ feelings better than they do (-.633). The content of these items suggests that the factor is related to emotional intelligence. Its correlation with LPFS scores is trivial, r = -.007.

In addition, there were 41 correlated residuals. Correlated residuals are essentially mini-factors with two items, but it is impossible to determine the loadings of items on these factors. Most of these correlated residuals were small (.1 to .2). Only two item pairs had correlated residuals greater than .3,, namely “I don’t have a clue about why other people do what they do” correlated with “I don’t understand what motivates other people at all” (.453) and “I can only get close to somebody who understands me very well” correlated with “I can only get close to someone who can acknowledge and address my needs” (..367). Whether these correlated residuals reflect important content that requires more items or whether they are merely method factors due to similar wording is an open question, but it does not affect the interpretation of the LPFS scores because these mini factors do not substantially contribute to the variance in LPFS scores.

The main finding is that the factor analysis of the LPFS items revealed 2 major factors and many minor factors. One of the major factors is a method factor that reflects scoring of the items. The other factor reflects a general disposition to score higher or lower on desirable attributes. This factor account for 70% of the total variance in LPFS scores. The important question is whether this factor reflects actual personality functioning – whatever this might be – or a response style to agree more strongly with desirable items and to disagree more with undesirable items.

Validation of the General Factor of the LPFS

A basic step in construct validation research is to demonstrate that correlations with other measures are consistent with theoretical expectations (Cronbach & Meehl, 1955; Markus & Borsboom, 2013; Schimmack, 2021). The focus is not only on positive correlations with related measures, but also the absence of correlations with measures that are not expected to be correlated. This is often called convergent and discriminant validity (Campbell & Fiske, 1959). Moreover, validity is a quantitative construct and the magnitude of correlations is also important. If the LPFS is a measure of core personality functioning it should corelate with life outcomes (convergent validity). This hypothesis could not be examined with these data because no life outcomes were measured. Anther prediction is that LPFS scores should not corelate with measures of response styles (discriminant validity). This hypothesis could be examined because the dataset contained a measure of the Big Five personality traits and it is possible to separate content and response styles in Big Five measures because multi-method studies show that the Big Five are largely independent (Anusic et al., 2009; Biesanz & West, 2004; Chang, Connelly, & Geeza, 2012; DeYoung, 2006). Additional evidence shows that the evaluative factor in personality ratings predicts self-ratings of well-being, but is a weak or no predictor of informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2020). This is a problem for the interpretation of this factor as a measure of personality functioning because low functioning should produce distress that is notable to others. Thus, a high correlation between the evaluative factor in ratings of personality and personality disorder would suggest that the factor reflects a rating bias rather than personality functioning.

I first fitted a measurement model to the Big Five Inventory – 2 (Soto & John, 2017). In this case, it was possible to use a confirmatory approach because the structure of the BFI-2 is well-known. I modeled 15 primary factors with loadings on the Big Five factors as higher-order factors. In addition, the model included one factor for evaluative bias and one factor for acquiescence bias based on the scoring of items. This model had reasonable fit, but some problems were apparent. The conscientiousness facet “Responsibility” seemed to combine two separate facets that were represented by two items each. I also had problems with the first two items of the Agreeableness-facet Trust. Thus, these items were omitted from the model. These modifications are not substantial and do not undermine the interpretation of the factors in the model. The model also included several well-known secondary relationships. Namely, anxiety (N) and depression (N) had negative loadings on extraversion, respectfulness (A) had a negative loading on Extraversion, Assertiveness (E) had a negatively loading on Agreeableness, Compassion (A) had a positive loading on N, and Productiveness (C) had a positive loading on E. Finally, there were 5 pairs of correlated residuals due to similar item content. The fit of this final model ( was acceptable, CFI = .906, RMSEA = .045.Only two primary loadings on the 15 facet factors were less than .4, but still greater than .3.

I then combined the two models without making any modifications to either model. The only additional parameters were used to relate the two models to each other. One parameter regressed the general factor of the LPFS model on the evaluative bias factor in the BFI model. Another one did the same for the two response style factors. Modification indices suggested several additional relationships that were added to the model. The fit of the final model ( was acceptable, CFI = .875, RMSEA = .032. Difficulties with goal setting (LPFS content factor 4) was strongly negatively related to the productivity facet of conscientiousness, r = -.81, and slightly positively related to the compassion facet of agreeableness, r = .178. The Emotionality factor (LPFS content factor 9) was strongly correlated with Neuroticism, r = .776. The first content factor was also strongly correlated with the depression facet of neuroticism, r = .72, and moderately negatively correlated with agreeableness, r = -.264. The need for approval factor (content factor 5) was also strongly corelated with neuroticism, r = .608, and moderately negatively related to the assertiveness facet of agreeableness, r = -.249. Content factor 2 (“close relationships) was moderately negatively related to the trust facet of agreeableness, r = -.408, and weakly negatively related to the assertiveness facet of extraversion, r = -.117. A focus on exchange relationships (content factor 3) was moderately negatively correlated with agreeableness, r = -.379. Finally, content factor 10 had a moderate correlation with extraversion. In addition, 14 LPFS items had small to moderate loadings on some Big Five factors. Only three items had loadings greater than .3, namely “my emotions rapidly shift around” on Neuroticism, r = .404, “Sometimes I’m not very cooperative because other people don’t live up to my standards” on Agreeableness, and “It seems as if most other people have their life together more than I” on the depression facet of Neuroticism, r = .310.

These relationships imply that some of the variance in LPFS scores can be predicted from the BFI factors, but the effect sizes are small. Neuroticism correlates only r = .123 and explains only 1.5% of the variance in LPFS total scores. Correlations are also weak for Extraversion, r = -.104, Agreeableness, r = -.096, and Conscientiousness, r = -.045. Thus, if the LPFS is a measure of core personality functioning, we would have to assume that core personality functioning is largely independent of variation along the Big Five factors of normal personality.

In contrast to these weak relationships, the evaluative bias factor in self-ratings of normal personality is strongly correlated with the general factor of the LPFS, r = .901. Given the strong contribution of the general factor to LPFS scores, it is not surprising that the evaluative factor of the Big Five explains a large amount of the variance in LPFS scores, r = .748. In this case, it is not clear whether the correlation coefficient should be squared because evaluative bias in BFI ratings is not a pure measure of evaluative bias. A model with more than two measures of evaluative bias would be needed to quantify how much a general – questionnaire independent – evaluative bias factor contributes to LPFS scores. Nevertheless, the present results confirm that the evaluative factor in ratings of normal personality is strongly related to the evaluative factor in ratings of personality disorders (McCabe, Oltmanns, & Widiger, 2022).

Making Lemonade: A New Evaluative Bias Measure

My analyses provide clear evidence that most of the variance in LPFS scores reflects a general evaluative factor that corelates strongly with an evaluative factor in ratings of normal personality. In addition, the analyses showed that only some items in the LPFS are substantially related to normal personality. This implies that many LPFS items measure desirability without measuring normal personality. This provides an opportunity to develop a measure of evaluative bias that is independent of normal personality. This measure can be used to control for evaluative bias in self-ratings. A new measure of evaluative bias would be highly welcome (to avoid the pun desirable) because existing social desirability scales lack validity, in part because they confound bias and actual personality content.

To minimize the influence of acquiescence bias, I tried to find an equal number of direct and reverse coded items. I selected items with high loadings on the evaluative factor and low loadings on the LPFS content factors or the Big Five factors. This produced a 10-item scale with 6 negative and 4 positive items.

Almost no close relationship turns out well in the end.
I can’t even imagine living a life that I would find satisfying.
I don’t have many positive interactions with other people.
I have little understanding of how I feel or what I do.
I tend to let others set my goals for me, rather than come up with them on my own.
I’m not sure exactly what standards I’ve set for myself.

I can appreciate the viewpoint of other people even when I disagree with them.
I work on my close relationships, because they are important to me.
I’m very aware of the impact I’m having on other people.
I’ve got goals that are reasonable given my abilities.

I added these 10 items to the Big Five model and specified a social desirability factor and an acquiescence factor. This model ( had acceptable fit, CFI = .892, RMSEA = .043. Three items had weak (< .3) loadings on one of the Big Five factors, indicating that the SD items were mostly independent of actual Big Five content. Thus, SD scores are practically independent of variance in normal personality as measured with the BFI-2. The correlation between the evaluative factor and the SD factor was r = .877 and the correlation with the SD scale as r = .79. This finding suggests that it is possible to capture a large portion of the evaluative variance in self-ratings of personality with the new 10-item social desirability scale. Future research with other measures of evaluative bias (cf. Anusic et al., 2009) and multi-method assessment of personality is needed before this measure can be used to control for socially desirable responding.


Morey (2017) introduced the Levels of Personality Functioning Scale (LPFS) as a self-report measure of general personality pathology, core personality functioning, or the severity of personality dysfunction. Hopewood et al. (2018) conducted a validation study of the LPFS and concluded that their results support the validity of the LPFS. More recently, Morey et al. (2022) reiterate the claim that the LPFS has demonstrated strong validity. However, several commentaries pointed out problems with these claims (Sleep & Lynam, 2022). Sleep and Lynam (2022) suggested that the “LPFS may be assessing little more than general distress” (p. 326). They also suggested that overlap between LPFS content and normal personality content is a problem. As shown here as well, some LPFS items relate to neuroticism, conscientiousness, or agreeableness. However, it is not clear why this is a problem. It would be rather odd if core personality functioning were unrelated to normal personality. Moreover, the fact that some items are related to Big Five factors does not imply that the LPFS measures little more than normal personality. The present results show that LPFS scores are only weakly related to the Big Five factors. The real problem is that LPFS scores are much more strongly related to the evaluative factor in normal personality ratings than to measures of distress such as neuroticism or its depression facet.

A major shortcoming in the debate among clinical researchers interested in personality disorder is the omission of research on the measurement of normal personality. Progress in the measurement of normal personality was made in the early 2000s. when some articles combined multi-method measurement with latent variable modeling (Anusic et al., 2009; Biesanz & West, 2004; deYoung, 2006). These studies show that the general evaluative factor is unique to individual raters. Thus, it lacks convergent validity as a measure of a personality trait that is reflected in observable behaviors. The high correlation between this factor and the general factor in measures of personality disorders provides further evidence that the factor is a rater-specific bias rather than an disposition to display symptoms of sever personality disorders because dysfunction of personality is visible in social situations.

One limitation of the present study is that it used only self-report data. The interpretation of the general factor in self-ratings of normal personality is based on previous validation studies with multiple raters, but it would be preferable to conduct a multi-method study of the LPFS. The main prediction is that the general factor in the LPFS should show low convergent validity across raters. One study with self and informant ratings of personality disorders provided initial evidence for this hypothesis, but structural equation modeling would be needed to quantify the amount of convergent validity in evaluative variance across raters (Quilty, Cosentino, & Bagby, 2018).

In conclusion, while it is too early to dismiss the presence of a general factor of personality disorders, the present results raise serious concerns about the construct validity of the Level of Personality Functioning Scale. While LPFS scores reflect a general factor, it is not clear that this general factor corresponds to a general disposition of personality functioning. First, conceptual analysis questions the construct of personality functioning. Second, empirical analysis show that the general factor correlates highly with evaluative bias in personality ratings. As a result, researchers interested in personality disorders need to rethink the concept of personality disorders, use a multi-method approach to the measurement of personality disorders, and develop measurement models that separate substantive variance from response artifacts. They also need to work more closely with personality researches because a viable theory of personality disorders has to be grounded in a theory of normal personality functioning.


Biesanz, J. C., & West, S. G. (2004). Towards Understanding Assessments of the Big Five: Multitrait-Multimethod Analyses of Convergent and Discriminant Validity Across Measurement Occasion and Type of Observer. Journal of Personality, 72(4), 845–876.

Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement theory and public policy: Proceedings of a symposium in honor of Lloyd G. Humphreys (pp. 147–171). Urbana: University of Illinois Press.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.

Chang, L., Connelly, B. S., & Geeza, A. A. (2012). Separating method factors and higher order traits of the Big Five: A meta-analytic multitrait–multimethod approach. Journal of Personality and Social Psychology, 102(2), 408–426.

DeYoung, C. G. (2006). Higher-order factors of the Big Five in a multi-informant sample. Journal of Personality and Social Psychology, 91(6), 1138–1151.

Hopwood, C. J., Good, E. W., & Leslie C. Morey (2018) Validity of the DSM–5 Levels of Personality Functioning Scale–Self Report, Journal of Personality Assessment, 100:6, 650-659, DOI: 10.1080/00223891.2017.1420660

Quilty, L. C., Cosentino, N., & Bagby, R. M. (2018). Response bias and the personality inventory for DSM-5: Contrasting self- and informant-report. Personality disorders9(4), 346–353.

Kim, H., Schimmack, U., & Oishi, S. (2012). Cultural differences in self- and other-evaluations and well-being: A study of European and Asian Canadians. Journal of Personality and Social Psychology, 102(4), 856–873.

Markus, K. A., & Borsboom, D. (2013). Frontiers of test validity theory: Measurement, causation, and meaning. Routledge/Taylor & Francis Group.

McCabe, G. A., Oltmanns, J. R., & Widiger, T. A. (2022). The General Factors of Personality Disorder, Psychopathology, and Personality. Journal of personality disorders36(2), 129–156.

Morey, L. C. (2017). Development and initial evaluation of a self-report form of the DSM–5 Level of Personality Functioning Scale. Psychological Assessment, 29(10), 1302–1308.

Morey, L. C., McCredie, M. N., Bender, D. S., & Skodol, A. E. (2022). Criterion A: Level of personality functioning in the alternative DSM–5 model for personality disorders. Personality Disorders: Theory, Research, and Treatment, 13(4), 305–315.

Schimmack, U., & Kim, H. (2020). An integrated model of social psychological and personality psychological perspectives on personality and wellbeing. Journal of Research in Personality, 84, Article 103888.

Sleep, C. E., & Lynam, D. R. (2022). The problems with Criterion A: A comment on Morey et al. (2022). Personality Disorders: Theory, Research, and Treatment, 13(4), 325–327.

Sleep, C. E., Weiss, B., Lynam, D. R., & Miller, J. D. (2020). The DSM-5 section III personality disorder criterion a in relation to both pathological and general personality traits. Personality Disorders: Theory, Research, and Treatment, 11(3), 202–212.

Skodol, A.E. (2011), Scientific issues in the revision of personality disorders for DSM-5. Personality and Mental Health, 5: 97-111

Zachar, P. (2017). Personality Disorder: Philosophical Problems. In: Schramme, T., Edwards, S. (eds) Handbook of the Philosophy of Medicine. Springer, Dordrecht.

Zachar, P., & Krueger, R. F. (2013). Personality disorder and validity: A history of controversy. In K. W. M. Fulford, M. Davies, R. G. T. Gipps, G. Graham, J. Z. Sadler, G. Stanghellini, & T. Thornton (Eds.), The Oxford handbook of philosophy and psychiatry (pp. 889–910). Oxford University Press.

Zimmermann, J. (2022). Beyond defending or abolishing Criterion A: Comment on Morey et al. (2022). Personality Disorders: Theory, Research, and Treatment, 13(4), 321–324.