Replicability of Research in Frontiers of Psychology

Summary

The z-curve analysis of results in this journal shows (a) that many published results are based on studies with low to modest power, (b) selection for significance inflates effect size estimates and the discovery rate of reported results, and (c) there is no evidence that research practices have changed over the past decade. Readers should be careful when they interpret results and recognize that reported effect sizes are likely to overestimate real effect sizes, and that replication studies with the same sample size may fail to produce a significant result again. To avoid misleading inferences, I suggest using alpha = .005 as a criterion for valid rejections of the null-hypothesis. Using this criterion, the risk of a false positive result is below 2%. I also recommend computing a 99% confidence interval rather than the traditional 95% confidence interval for the interpretation of effect size estimates.

Given the low power of many studies, readers also need to avoid the fallacy to report non-significant results as evidence for the absence of an effect. With 50% power, the results can easily switch in a replication study so that a significant result becomes non-significant and a non-significant result becomes significant. However, selection for significance will make it more likely that significant results become non-significant than observing a change in the opposite direction.

The average power of studies in a heterogeneous journal like Frontiers of Psychology provides only circumstantial evidence for the evaluation of results. When other information is available (e.g., z-curve analysis of a discipline, author, or topic, it may be more appropriate to use this information).

Report

Frontiers of Psychology was created in 2010 as a new online-only journal for psychology. It covers many different areas of psychology, although some areas have specialized Frontiers journals like Frontiers in Behavioral Neuroscience.

The business model of Frontiers journals relies on publishing fees of authors, while published articles are freely available to readers.

The number of articles in Frontiers of Psychology has increased quickly from 131 articles in 2010 to 8,072 articles in 2022 (source Web of Science). With over 8,000 published articles Frontiers of Psychology is an important outlet for psychological researchers to publish their work. Many specialized, print-journals publish fewer than 100 articles a year. Thus, Frontiers of Psychology offers a broad and large sample of psychological research that is equivalent to a composite of 80 or more specialized journals.

Another advantage of Frontiers of Psychology is that it has a relatively low rejection rate compared to specialized journals that have limited journal space. While high rejection rates may allow journals to prioritize exceptionally good research, articles published in Frontiers of Psychology are more likely to reflect the common research practices of psychologists.

To examine the replicability of research published in Frontiers of Psychology, I downloaded all published articles as PDF files, converted PDF files to text files, and extracted test-statistics (F, t, and z-tests) from published articles. Although this method does not capture all published results, there is no a priori reason that results reported in this format differ from other results. More importantly, changes in research practices such as higher power due to larger samples would be reflected in all statistical tests.

As Frontiers of Psychology only started shortly before the replication crisis in psychology increased awareness about the problem of low statistical power and selection for significance (publication bias), I was not able to examine replicability before 2011. I also found little evidence of changes in the years from 2010 to 2015. Therefore, I use this time period as the starting point and benchmark for future years.

Figure 1 shows a z-curve plot of results published from 2010 to 2014. All test-statistics are converted into z-scores. Z-scores greater than 1.96 (the solid red line) are statistically significant at alpha = .05 (two-sided) and typically used to claim a discovery (rejection of the null-hypothesis). Sometimes even z-scores between 1.65 (the dotted red line) and 1.96 are used to reject the null-hypothesis either as a one-sided test or as marginal significance. Using alpha = .05, the plot shows 71% significant results, which is called the observed discovery rate (ODR).

Visual inspection of the plot shows a peak of the distribution right at the significance criterion. It also shows that z-scores drop sharply on the left side of the peak when the results do not reach the criterion for significance. This wonky distribution cannot be explained with sampling error. Rather it shows a selective bias to publish significant results by means of questionable practices such as not reporting failed replication studies or inflating effect sizes by means of statistical tricks. To quantify the amount of selection bias, z-curve fits a model to the distribution of significant results and estimates the distribution of non-significant (i.e., the grey curve in the range of non-significant results). The discrepancy between the observed distribution and the expected distribution shows the file-drawer of missing non-significant results. Z-curve estimates that the reported significant results are only 31% of the estimated distribution. This is called the expected discovery rate (EDR). Thus, there are more than twice as many significant results as the statistical power of studies justifies (71% vs. 31%). Confidence intervals around these estimates show that the discrepancy is not just due to chance, but active selection for significance.

Using a formula developed by Soric (1989), it is possible to estimate the false discovery risk (FDR). That is, the probability that a significant result was obtained without a real effect (a type-I error). The estimated FDR is 12%. This may not be alarming, but the risk varies as a function of the strength of evidence (the magnitude of the z-score). Z-scores that correspond to p-values close to p =.05 have a higher false positive risk and large z-scores have a smaller false positive risk. Moreover, even true results are unlikely to replicate when significance was obtained with inflated effect sizes. The most optimistic estimate of replicability is the expected replication rate (ERR) of 69%. This estimate, however, assumes that a study can be replicated exactly, including the same sample size. Actual replication rates are often lower than the ERR and tend to fall between the EDR and ERR. Thus, the predicted replication rate is around 50%. This is slightly higher than the replication rate in the Open Science Collaboration replication of 100 studies which was 37%.

Figure 2 examines how things have changed in the next five years.

The observed discovery rate decreased slightly, but statistically significantly, from 71% to 66%. This shows that researchers reported more non-significant results. The expected discovery rate increased from 31% to 40%, but the overlapping confidence intervals imply that this is not a statistically significant increase at the alpha = .01 level. (if two 95%CI do not overlap, the difference is significant at around alpha = .01). Although smaller, the difference between the ODR of 60% and the EDR of 40% is statistically significant and shows that selection for significance continues. The ERR estimate did not change, indicating that significant results are not obtained with more power. Overall, these results show only modest improvements, suggesting that most researchers who publish in Frontiers in Psychology continue to conduct research in the same way as they did before, despite ample discussions about the need for methodological reforms such as a priori power analysis and reporting of non-significant results.

The results for 2020 show that the increase in the EDR was a statistical fluke rather than a trend. The EDR returned to the level of 2010-2015 (29% vs. 31), but the ODR remained lower than in the beginning, showing slightly more reporting of non-significant results. The size of the file drawer remains large with an ODR of 66% and an EDR of 72%.

The EDR results for 2021 look again better, but the difference to 2020 is not statistically significant. Moreover, the results in 2022 show a lower EDR that matches the EDR in the beginning.

Overall, these results show that results published in Frontiers in Psychology are selected for significance. While the observed discovery rate is in the upper 60%s, the expected discovery rate is around 35%. Thus, the ODR is nearly twice the rate of the power of studies to produce these results. Most concerning is that a decade of meta-psychological discussions about research practices has not produced any notable changes in the amount of selection bias or the power of studies to produce replicable results.

How should readers of Frontiers in Psychology articles deal with this evidence that some published results were obtained with low power and inflated effect sizes that will not replicate? One solution is to retrospectively change the significance criterion. Comparisons of the evidence in original studies and replication outcomes suggest that studies with a p-value below .005 tend to replicate at a rate of 80%, whereas studies with just significant p-values (.050 to .005) replicate at a much lower rate (Schimmack, 2022). Demanding stronger evidence also reduces the false positive risk. This is illustrated in the last figure that uses results from all years, given the lack of any time trend.

In the Figure the red solid line moved to z = 2.8; the value that corresponds to p = .005, two-sided. Using this more stringent criterion for significance, only 45% of the z-scores are significant. Another 25% were significant with alpha = .05, but are no longer significant with alpha = .005. As power decreases when alpha is set to more stringent, lower, levels, the EDR is also reduced to only 21%. Thus, there is still selection for significance. However, the more effective significance filter also selects for more studies with high power and the ERR remains at 72%, even with alpha = .005 for the replication study. If the replication study used the traditional alpha level of .05, the ERR would be even higher, which explains the finding that the actual replication rate for studies with p < .005 is about 80%.

The lower alpha also reduces the risk of false positive results, even though the EDR is reduced. The FDR is only 2%. Thus, the null-hypothesis is unlikely to be true. The caveat is that the standard null-hypothesis in psychology is the nil-hypothesis and that the population effect size might be too small to be of practical significance. Thus, readers who interpret results with p-values below .005 should also evaluate the confidence interval around the reported effect size, using the more conservative 99% confidence interval that correspondence to alpha = .005 rather than the traditional 95% confidence interval. In many cases, this confidence interval is likely to be wide and provide insufficient information about the strength of an effect.

Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.

Data

To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.

Conclusion

The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.

Requiring high-powered studies from scientists with resource constraints

This blog post is now included in the paper “Sample size justification” available at PsyArXiv. 

Underpowered studies make it very difficult to learn something useful from the studies you perform. Low power means you have a high probability of finding non-significant results, even when there is a true effect. Hypothesis tests which high rates of false negatives (concluding there is nothing, when there is something) become a malfunctioning tool. Low power is even more problematic combined with publication bias (shiny app). After repeated warnings over at least half a century, high quality journals are starting to ask authors who rely on hypothesis tests to provide a sample size justification based on statistical power.
The first time researchers use power analysis software, they typically think they are making a mistake, because the sample sizes required to achieve high power for hypothesized effects are much larger than the sample sizes they collected in the past. After double checking their calculations, and realizing the numbers are correct, a common response is that there is no way they are able to collect this number of observations.
Published articles on power analysis rarely tell researchers what they should do if they are hired on a 4 year PhD project where the norm is to perform between 4 to 10 studies that can cost at most 1000 euro each, learn about power analysis, and realize there is absolutely no way they will have the time and resources to perform high-powered studies, given that an effect size estimate from an unbiased registered report suggests the effect they are examining is half as large as they were led to believe based on a published meta-analysis from 2010. Facing a job market that under the best circumstances is a nontransparent marathon for uncertainty-fetishists, the prospect of high quality journals rejecting your work due to a lack of a solid sample size justification is not pleasant.
The reason that published articles do not guide you towards practical solutions for a lack of resources, is that there are no solutions for a lack of resources. Regrettably, the mathematics do not care about how small the participant payment budget is that you have available. This is not to say that you can not improve your current practices by reading up on best practices to increase the efficiency of data collection. Let me give you an overview of some things that you should immediately implement if you use hypothesis tests, and data collection is costly.
1) Use directional tests where relevant. Just following statements such as ‘we predict X is larger than Y’ up with a logically consistent test of that claim (e.g., a one-sided t-test) will easily give you an increase of 10% power in any well-designed study. If you feel you need to give effects in both directions a non-zero probability, then at least use lopsided tests.
2) Use sequential analysis whenever possible. It’s like optional stopping, but then without the questionable inflation of the false positive rate. The efficiency gains are so great that, if you complain about the recent push towards larger sample sizes without already having incorporated sequential analyses, I will have a hard time taking you seriously.
3) Increase your alpha level. Oh yes, I am serious. Contrary to what you might believe, the recommendation to use an alpha level of 0.05 was not the sixth of the ten commandments – it is nothing more than, as Fisher calls it, a ‘convenient convention’. As we wrote in our Justify Your Alpha paper as an argument to not require an alpha level of 0.005: “without (1) increased funding, (2) a reward system that values large-scale collaboration and (3) clear recommendations for how to evaluate research with sample size constraints, lowering the significance threshold could adversely affect the breadth of research questions examined.” If you *have* to make a decision, and the data you can feasibly collect is limited, take a moment to think about how problematic Type 1 and Type 2 error rates are, and maybe minimize combined error rates instead of rigidly using a 5% alpha level.
4) Use within designs where possible. Especially when measurements are strongly correlated, this can lead to a substantial increase in power.
5) If you read this blog or follow me on Twitter, you’ll already know about 1-4, so let’s take a look at a very sensible paper by Allison, Allison, Faith, Paultre, & Pi-Sunyer from 1997: Power and money: Designing statistically powerful studies while minimizing financial costs (link). They discuss I) better ways to screen participants for studies where participants need to be screened before participation, II) assigning participants unequally to conditions (if the control condition is much cheaper than the experimental condition, for example), III) using multiple measurements to increase measurement reliability (or use well-validated measures, if I may add), and IV) smart use of (preregistered, I’d recommend) covariates.
6) If you are really brave, you might want to use Bayesian statistics with informed priors, instead of hypothesis tests. Regrettably, almost all approaches to statistical inferences become very limited when the number of observations is small. If you are very confident in your predictions (and your peers agree), incorporating prior information will give you a benefit. For a discussion of the benefits and risks of such an approach, see this paper by van de Schoot and colleagues.
Now if you care about efficiency, you might already have incorporated all these things. There is no way to further improve the statistical power of your tests, and by all plausible estimates of effects sizes you can expect or the smallest effect size you would be interested in, statistical power is low. Now what should you do?
What to do if best practices in study design won’t save you?
The first thing to realize is that you should not look at statistics to save you. There are no secret tricks or magical solutions. Highly informative experiments require a large number of observations. So what should we do then? The solutions below are, regrettably, a lot more work than making a small change to the design of your study. But it is about time we start to take them seriously. This is a list of solutions I see – but there is no doubt more we can/should do, so by all means, let me know your suggestions on twitter or in the comments.
1) Ask for a lot more money in your grant proposals.
Some grant organizations distribute funds to be awarded as a function of how much money is requested. If you need more money to collect informative data, ask for it. Obviously grants are incredibly difficult to get, but if you ask for money, include a budget that acknowledges that data collection is not as cheap as you hoped some years ago. In my experience, psychologists are often asking for much less money to collect data than other scientists. Increasing the requested funds for participant payment by a factor of 10 is often reasonable, given the requirements of journals to provide a solid sample size justification, and the more realistic effect size estimates that are emerging from preregistered studies.
2) Improve management.
If the implicit or explicit goals that you should meet are still the same now as they were 5 years ago, and you did not receive a miraculous increase in money and time to do research, then an update of the evaluation criteria is long overdue. I sincerely hope your manager is capable of this, but some ‘upward management’ might be needed. In the coda of Lakens & Evers (2014) we wrote “All else being equal, a researcher running properly powered studies will clearly contribute more to cumulative science than a researcher running underpowered studies, and if researchers take their science seriously, it should be the former who is rewarded in tenure systems and reward procedures, not the latter.” and “We believe reliable research should be facilitated above all else, and doing so clearly requires an immediate and irrevocable change from current evaluation practices in academia that mainly focus on quantity.” After publishing this paper, and despite the fact I was an ECR on a tenure track, I thought it would be at least principled if I sent this coda to the head of my own department. He replied that the things we wrote made perfect sense, instituted a recommendation to aim for 90% power in studies our department intends to publish, and has since then tried to make sure quality, and not quantity, is used in evaluations within the faculty (as you might have guessed, I am not on the job market, nor do I ever hope to be).
3) Change what is expected from PhD students.
When I did my PhD, there was the assumption that you performed enough research in the 4 years you are employed as a full-time researcher to write a thesis with 3 to 5 empirical chapters (with some chapters having multiple studies). These studies were ideally published, but at least publishable. If we consider it important for PhD students to produce multiple publishable scientific articles during their PhD’s, this will greatly limit the types of research they can do. Instead of evaluating PhD students based on their publications, we can see the PhD as a time where researchers learn skills to become an independent researcher, and evaluate them not based on publishable units, but in terms of clearly identifiable skills. I personally doubt data collection is particularly educational after the 20th participant, and I would probably prefer to  hire a post-doc who had well-developed skills in programming, statistics, and who broadly read the literature, then someone who used that time to collect participant 21 to 200. If we make it easier for PhD students to demonstrate their skills level (which would include at least 1 well written article, I personally think) we can evaluate what they have learned in a more sensible manner than now. Currently, difference in the resources PhD students have at their disposal are a huge confound as we try to judge their skill based on their resume. Researchers at rich universities obviously have more resources – it should not be difficult to develop tools that allow us to judge the skills of people where resources are much less of a confound.
4) Think about the questions we collectively want answered, instead of the questions we can individually answer.
Our society has some serious issues that psychologists can help address. These questions are incredibly complex. I have long lost faith in the idea that a bottom-up organized scientific discipline that rewards individual scientists will manage to generate reliable and useful knowledge that can help to solve these societal issues. For some of these questions we need well-coordinated research lines where hundreds of scholars work together, pool their resources and skills, and collectively pursuit answers to these important questions. And if we are going to limit ourselves in our research to the questions we can answer in our own small labs, these big societal challenges are not going to be solved. Call me a pessimist. There is a reason we resort to forming unions and organizations that have to goal to collectively coordinate what we do. If you greatly dislike team science, don’t worry – there will always be options to make scientific contributions by yourself. But now, there are almost no ways for scientists who want to pursue huge challenges in large well-organized collectives of hundreds or thousands of scholars (for a recent exception that proves my rule by remaining unfunded: see the Psychological Science Accelerator). If you honestly believe your research question is important enough to be answered, then get together with everyone who also thinks so, and pursue answeres collectively. Doing so should, eventually (I know science funders are slow) also be more convincing as you ask for more resources to do the resource (as in point 1).
If you are upset that as a science we lost the blissful ignorance surrounding statistical power, and are requiring researchers to design informative studies, which hits substantially harder in some research fields than in others: I feel your pain. I have argued against universally lower alpha levels for you, and have tried to write accessible statistics papers that make you more efficient without increasing sample sizes. But if you are in a research field where even best practices in designing studies will not allow you to perform informative studies, then you need to accept the statistical reality you are in. I have already written too long a blog post, even though I could keep going on about this. My main suggestions are to ask for more money, get better management, change what we expect from PhD students, and self-organize – but there is much more we can do, so do let me know your top suggestions. This will be one of the many challenges our generation faces, but if we manage to address it, it will lead to a much better science.

Calculating Confidence Intervals around Standard Deviations

Power analyses require accurate estimates of the standard deviation. In this blog, I explain how to calculate confidence intervals around standard deviation estimates obtained from a sample, and show how much sample sizes in an a-priori power analysis can differ based on variation in estimates of the standard deviation.
If we calculate a standard deviation from a sample, this value is an estimate of the true value in the population. In small samples, our estimate can be quite far off, while due to the law of large numbers, as our sample size increases, we will be measuring the standard deviation more accurately. Since the sample standard deviation is an estimate with uncertainty, we can calculate a 95% confidence interval around it.
Expressing the uncertainty in our estimate of the standard deviation can be useful. When researchers plan to simulate data, or perform an a-priori power analysis, they need accurate estimates of the standard deviation. For simulations, the standard deviation needs to be accurate because we want to generate data that will look like the real data we will eventually collect. For power analyses we often want to think about the smallest effect size of interest, which can be specified as the difference in means you care about. To perform a power analysis we also need to specify the expected standard deviation of the data. Sometimes researchers will use pilot data to get an estimate of the standard deviation. Since the estimate of the population standard deviation based on a pilot study has some uncertainty, the width of confidence intervals around the standard deviation might be a useful way to show how much variability one can expect.
Below is the R code to calculate the confidence interval around a standard deviation from a sample, but you can also use this free GraphPad calculator. The R code then calculates an effect size based on a smallest effect size of interest of half a scale point (0.5) for a scale that has a true standard deviation of 1. The 95% confidence interval for the standard deviation based on a sample of 100 observation ranges from 0.878 to 1.162. If we draw a sample of 100 observations and happen to observe a value on the lower or upper bound of the 95% CI the effect size we calculate will be a Cohen’s d of 0.5/0.878 = 0.57 or 0.5/1.162 = 0.43. This is quite a difference in the effect size we might use for a power calculation. If we enter these effect size estimates in an a-priori power analysis where we aim to get 90% power using an alpha of 0.05 we will estimate that we need either 66 participants in each group, or 115 participants in each group.
It is clear sample sizes from a-priori power anayses depend strongly on an accurate estimate of the standard deviation. Keep into account that estimates of the standard deviation have uncertainty. Use validated or existing measures for which accurate estimates of the standard deviation in your population of interest are available, so that you can rely on a better estimate of the standard deviation in power analyses.
Some people argue that if you have such a limited understanding of the measures you are using that you do not even know the standard deviation of the measure in your population of interest, you are not ready to use that measure to test a hypothesis. If you are doing a power analysis but realize you have no idea what the standard deviation is, maybe you first need to spend more time validating the measures you are using.
When performing simulations or power analyses the same cautionary note can be made for estimates of correlations between dependent variables. When you are estimating these values from a sample, and want to perform simulations and/or power analyses, be aware that all estimates have some uncertainty. Try to get as accurate estimates as possible from pre-existing data. If possible, be a bit more conservative in sample size calculations based on estimated parameters, just to be sure.