Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.

Data

To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.

Conclusion

The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.

Aber bitte ohne Sanna

Abstract

Social psychologists have failed to clean up their act and their literature. Here I show unusually high effect sizes in non-retracted articles by Sanna, who retracted several articles. I point out that non-retraction does not equal credibility and I show that co-authors like Norbert Schwarz lack any motivation to correct the published record. The inability of social psychologists to acknowledge and correct their mistakes renders social psychology a para-science that lacks credibility. Even meta-analyses cannot be trusted because they do not correct properly for the use of questionable research practices.

Introduction

When I grew up, a popular German Schlager was the song “Aber bitte mit Sahne.” The song is about Germans love of deserts with whipped cream. So, when I saw articles by Sanna, I had to think about whipped cream, which is delicious. Unfortunately, articles by Sanna are the exact opposite. In the early 2010s, it became apparent that Sanna had fabricated data. However, unlike the thorough investigation of a similar case in the Netherlands, the extent of Sanna’s fraud remains unclear (Retraction Watch, 2012). The latest count of Sanna’s retracted articles was 8 (Retraction Watch, 2013).

WebOfScience shows 5 retraction notices for 67 articles, which means 62 articles have not been retracted. The question is whether these article can be trusted to provide valid scientific information? The answer to this question matters because Sanna’s articles are still being cited at a rate of over 100 citations per year.

Meta-Analysis of Ease of Retrieval

The data are also being used in meta-analyses (Weingarten & Hutchinson, 2018). Fraudulent data are particularly problematic for meta-analysis because fraud can produce large effect size estimates that may inflate effect size estimates. Here I report the results of my own investigation that focusses on the ease-of-retrieval paradigm that was developed by Norbert Schwarz and colleagues (Schwarz et al., 1991).

The meta-analysis included 7 studies from 6 articles. Two studies produced independent effect size estimates for 2 conditions for a total of 9 effect sizes.

Sanna, L. J., Schwarz, N., & Small, E. M. (2002). Accessibility experiences and the hindsight bias: I knew it all along versus it could never have happened. Memory & Cognition, 30(8), 1288–1296. https://doi.org/10.3758/BF03213410 [Study 1a, 1b]

Sanna, L. J., Schwarz, N., & Stocker, S. L. (2002). When debiasing backfires: Accessible content and accessibility experiences in debiasing hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(3), 497–502. https://doi.org/10.1037/0278-7393.28.3.497
[Study 1 & 2]

Sanna, L. J., & Schwarz, N. (2003). Debiasing the hindsight bias: The role of accessibility experiences and (mis)attributions. Journal of Experimental Social Psychology, 39(3), 287–295. https://doi.org/10.1016/S0022-1031(02)00528-0 [Study 1]

Sanna, L. J., Chang, E. C., & Carter, S. E. (2004). All Our Troubles Seem So Far Away: Temporal Pattern to Accessible Alternatives and Retrospective Team Appraisals. Personality and Social Psychology Bulletin, 30(10), 1359–1371. https://doi.org/10.1177/0146167204263784
[Study 3a]

Sanna, L. J., Parks, C. D., Chang, E. C., & Carter, S. E. (2005). The Hourglass Is Half Full or Half Empty: Temporal Framing and the Group Planning Fallacy. Group Dynamics: Theory, Research, and Practice, 9(3), 173–188. https://doi.org/10.1037/1089-2699.9.3.173 [Study 3a, 3b]

Carter, S. E., & Sanna, L. J. (2008). It’s not just what you say but when you say it: Self-presentation and temporal construal. Journal of Experimental Social Psychology, 44(5), 1339–1345. https://doi.org/10.1016/j.jesp.2008.03.017 [Study 2]

When I examined Sanna’s results, I found that all 9 of these 9 effect sizes were extremely large with effect size estimates being larger than one standard deviation. A logistic regression analysis that predicted authorship (With Sanna vs. Without Sanna) showed that the large effect sizes in Sanna’s articles were unlikely to be due to sampling error alone, b = 4.6, se = 1.1, t(184) = 4.1, p = .00004 (1 / 24,642).

These results show that Sanna’s effect sizes are not typical for the ease-of-retrieval literature. As one of his retracted articles used the ease-of retrieval paradigm, it is possible that these articles are equally untrustworthy. As many other studies have investigated ease-of-retrieval effects, it seems prudent to exclude articles by Sanna from future meta-analysis.

These articles should also not be cited as evidence for specific claims about ease-of-retrieval effects for the specific conditions that were used in these studies. As the meta-analysis shows, there have been no credible replications of these studies and it remains unknown how much ease of retrieval may play a role under the specified conditions in Sanna’s articles.

Discussion

The blog post is also a warning for young scientists and students of social psychology that they cannot trust researchers who became famous with the help of questionable research practices that produced too many significant results. As the reference list shows, several articles by Sanna were co-authored by Norbert Schwarz, the inventor of the ease-of-retrieval paradigm. It is most likely that he was unaware of Sanna’s fraudulent practices. However, he seemed to lack any concerns that the results might be too good to be true. After all, he encountered replicaiton failures in his own lab.

of course, we had studies that remained unpublished. Early on we experimented with different manipulations. The main lesson was: if you make the task too blatantly difficult, people correctly conclude the task is too difficult and draw no inference about themselves. We also had a couple of studies with unexpected gender differences” (Schwarz, email communication, 5/18,21).

So, why was he not suspicious when Sanna only produced successful results? I was wondering whether Schwarz had some doubts about these studies with the help of hindsight bias. After all, a decade or more later, we know that he committed fraud for some articles on this topic, we know about replication failures in larger samples (Yeager et al., 2019), and we know that the true effect sizes are much smaller than Sanna’s reported effect sizes (Weingarten & Hutchinson, 2018).

Hi Norbert, 
   thank you for your response. I am doing my own meta-analysis of the literature as I have some issues with the published one by Evan. More about that later. For now, I have a question about some articles that I came across, specifically Sanna, Schwarz, and Small (2002). The results in this study are very strong (d ~ 1).  Do you think a replication study powered for 95% power with d = .4 (based on meta-analysis) would produce a significant result? Or do you have concerns about this particular paradigm and do not predict a replication failure?
Best, Uli (email

His response shows that he is unwilling or unable to even consider the possibility that Sanna used fraud to produce the results in this article that he co-authored.

Uli, that paper has 2 experiments, one with a few vs many manipulation and one with a facial manipulation.  I have no reason to assume that the patterns won’t replicate. They are consistent with numerous earlier few vs many studies and other facial manipulation studies (introduced by Stepper & Strack,  JPSP, 1993). The effect sizes always depend on idiosyncracies of topic, population, and context, which influence accessible content and accessibility experience. The theory does not make point predictions and the belief that effect sizes should be identical across decades and populations is silly — we’re dealing with judgments based on accessible content, not with immutable objects.  

This response is symptomatic of social psychologists response to decades of research that has produced questionable results that often fail to replicate (see Schimmack, 2020, for a review). Even when there is clear evidence of questionable practices, journals are reluctant to retract articles that make false claims based on invalid data (Kitayama, 2020). And social psychologist Daryl Bem wants rather be remembered as loony para-psychologists than as real scientists (Bem, 2021).

The problem with these social psychologists is not that they made mistakes in the way they conducted their studies. The problem is their inability to acknowledge and correct their mistakes. While they are clinging to their CVs and H-Indices to protect their self-esteem, they are further eroding trust in psychology as a science and force junior scientists who want to improve things out of academia (Hilgard, 2021). After all, the key feature of science that distinguishes it from ideologies is the ability to correct itself. A science that shows no signs of self-correction is a para-science and not a real science. Thus, social psychology is currently para-science (i.e., “Parascience is a broad category of academic disciplines, that are outside the scope of scientific study, Wikipedia).

The only hope for social psychology is that young researchers are unwilling to play by the old rules and start a credibility revolution. However, the incentives still favor conformists who suck up to the old guard. Thus, it is unclear if social psychology will ever become a real science. A first sign of improvement would be to retract articles that make false claims based on results that were produced with questionable research practices. Instead, social psychologists continue to write review articles that ignore the replication crisis (Schwarz & Strack, 2016) as if repression can bend reality.

Nobody should believe them.

Awkward Silences: Technical Delays Can Diminish Feelings of Unity and Belonging

France_in_XXI_Century._Correspondance_cinema

Smooth social interaction is fundamental to a sense of togetherness. We’ve all experienced disrupted conversations—some caused by human awkwardness and others by breakdowns in technology. The content of our interactions does influence our connection to each other, but the form and process of communication also play a role.  Technical delays that occur below our conscious detection can still make us feel like we don’t quite click with the person we are trying to communicate with. The authors of a recently published PLOS ONE article, funded by a Google Research Award, investigated how delays introduced into technologically mediated conversations affected participants’ sense of solidarity with each other, defined as unity, belongingness, and shared reality.

For this research, conducted at University of Groningen, The Netherlands, participants in three sets of experiments sat in cubicles with headsets connected to computers (conditions that many of us with desk jobs can relate to) and were asked to talk about holidays for five minutes with an assigned partner. Some conversations were uninterrupted. Others were manipulated by introducing a one-second auditory delay. Some pairs knew about the delay and others did not. Afterward, the conversationalists completed a questionnaire about their sense of unity, belonging, understanding, and agreement with their partners.

journal.pone.0078363.s001

Researchers found that those participants whose conversations were interrupted expressed significantly diminished feelings of unity and belonging. Awareness of technical problems had no apparent effect on perceived solidarity.  Even acquaintances stated that they felt a disconnect, though to a lesser degree, than participants who did not know each other. Despite participants expressing that they felt less unity and belongingness with their partner even when they had the opportunity to attribute it to technical problems, technology did not get a free pass on the delayed signal. Those with an interrupted connection also expressed less satisfaction with the technology. Points may have been lost for both relationships and telecommunications.

In a world where our interactions are increasingly mediated by computers and mobile phones with less than perfect signals, the authors suggest that this research provides insight into how our daily interactions may be affected. The method of communication we choose may influence our personal and business relationships, especially among strangers. The authors also posit that technology meant to improve long distance communication by imitating face-to-face interaction may not measure up to expectations if it is not executed without interruptions or delays. Perhaps this is something to consider during your next awkward phone call or video conference— though your awareness of technology as a possible barrier ultimately may not make a difference in how you perceive the person on the other end of the line.

Citation: Koudenburg N, Postmes T, Gordijn EH (2013) Conversational Flow Promotes Solidarity. PLoS ONE 8(11): e78363. doi:10.1371/journal.pone.0078363

Images: First image by Villemard is in the public domain. Second image is Supplemetary Figure 1 from the article.