"At what sample size do correlations stabilize?", Schönbrodt & Perugini 2013 http://www.psy.lmu.de/allg2/download/schoenbrodt/pub/stable_correlations.pdf
Sample correlations converge to the population value with increasing sample size, but the estimates are often inaccurate in small samples. In this report we use Monte-Carlo simulations to determine the critical sample size from which on the magnitude of a correlation can be expected to be stable. The necessary sample size to achieve stable estimates for correlations depends on the effect size, the width of the corridor of stability (i.e., a corridor around the true value where deviations are tolerated), and the requested confidence that the trajectory does not leave this corridor any more. Results indicate that in typical scenarios the sample size should approach 250 for stable estimates.
...Consider, for example, a correlation of r = .40 in a sample of 25 participants. This correlation is significantly different from zero (p=.047). Hence, it might be concluded with some confidence that there is "something > 0" in the population, and the study would be counted as a success from the NHST perspective. However, plausible values of the true correlation ρ, as expressed by a 90% confidence interval, range from .07 to .65. The estimate is quite unsatisfactory from an accuracy point of view – in any scenario beyond the NHST ritual it will make a huge difference whether the true correlation in the population is .07, which would be regarded as a very small effect in most research contexts, or .65, which would be a very large effect in many contexts. Moreover, precise point estimates are relevant for a priori sample size calculations. Given the huge uncertainty in the true magnitude of the effect, it is hard to determine the necessary sample size to replicate the effect (e.g., for an intended power of 80% and ρ = .07: n = 1599 [!]; ρ = .40: n = 46; and for ρ = .65: n = 16).
...the following empirical example demonstrates. Multiple questionnaire scales have been administered in an open online study (Schönbrodt & Gerstenberg, 2012; Study 3). The thick black line in Figure 1 shows the evolution of the correlation between two scales, namely “hope of power” and “fear of losing control” when after each new participant the correlation is recalculated. It can be seen that the correlation evolved from r = .69 (n = 20, p < .001) to r = .26 (n = 274, p < .001). From a visual inspection, the trajectory did not stabilize up to a sample size of around 150. Data have not been rearranged – it is simply the order how participants dropped into the study. Some other correlations in this data set evolved from significantly negative to non-significant, others changed from one significant direction into the significant opposite, and some correlations were stable right from the beginning with only few fluctuations around the final estimate. But how do we get to know when a correlation estimate is sufficiently stable?
http://www.nicebread.de/WP/wp-content/uploads/2013/06/evolDemo.jpg
Figure 1: Actual (thick black line) and bootstrapped (thin gray lines) trajectories of a correlation. The dotted curved lines show the 95% confidence interval for the final correlation of r = .26 at each n. Dashed lines show the ± .1 corridor of stability (COS) around the final correlation. The point of stability (POS) is at n = 161. After that sample size the actual trajectory does not leave the COS.
...To assess the variability of possible trajectories, bootstrap samples of the final sample size can be drawn from the original raw data, and the evolutions of correlation for the new data sets are calculated. Figure 1 shows some exemplary bootstrapped trajectories. It can be seen that some trajectories start well above the final value (as the original trajectory), some start even with a significant negative value, and some start already within the COS without ever leaving it.
...The desired width of the corridor depends on the specific research context (see Figure 1 for a COS with w = .10). In this paper, three widths are used: ± .10, ± .15, and ± .20. Following the rules of thumb proposed by Cohen (1992), a value of .10 for w corresponds to a small effect size. Hence, if the sample correlation r stays within a corridor with w = ± .10, the resulting deviations only have a small effect size.
...In an analysis of 440 large-scale real world data sets in psychology only 4.3% could be considered as reasonable approximations to a Gaussian normal distribution (Micceri, 1989). Hence, deviations from normality are rather the rule than the exception in psychology. [Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166. doi:10.1037/0033-2909.105.1.156 http://faculty1.ucmerced.edu/sdepaoli/docs/Micceri%201989.pdf ]
...If Table 1 should be boiled down to simple answers, one can ask what effect size typically can be expected in personality. In a meta-meta-analysis summarizing 322 meta-analyses with more than 25'000 published studies in the field of personality and social psychology, Richard, Bond, and Stokes-Zoota (2003) ["One Hundred Years of Social Psychology Quantitatively Described" http://jenni.uchicago.edu/Spencer_Conference/Representative%20Papers/Richard%20et%20al,%202003.pdf ] report that the average published effect is r = .21, less than 25% of all meta-analytic effects sizes are greater than .30, and only 5.28% of all effects are greater than .50. Hence, without any specific prior knowledge it would be sensible to assume an effect size of .214. Further let's assume that a confidence level of 80% is requested (a level that is typically used for statistical power analyses), and only small effect sizes (w < .10) are considered as acceptable fluctuations. By applying these values on Table 1 the required sample size is around n = 238.
Of course, what is a meaningful or expected correlation can vary depending on the research context and questions. In some research contexts even small correlations of .10 might be meaningful and with consequential implications. In this case, larger samples are needed for stable correlations. In other research contexts the expected correlation can be greater (e.g., convergent validity between different measures of the same trait) or the researcher is willing to accept a slightly less stable estimate, perhaps compensating with an increased level of confidence. This would reduce the necessary sample size. But even under these conditions there are few occasions in which it may be justifiable to go below n = 150 and for typical research scenarios reasonable trade-offs between accuracy and confidence start to be achieved when n approaches 250.
Sample correlations converge to the population value with increasing sample size, but the estimates are often inaccurate in small samples. In this report we use Monte-Carlo simulations to determine the critical sample size from which on the magnitude of a correlation can be expected to be stable. The necessary sample size to achieve stable estimates for correlations depends on the effect size, the width of the corridor of stability (i.e., a corridor around the true value where deviations are tolerated), and the requested confidence that the trajectory does not leave this corridor any more. Results indicate that in typical scenarios the sample size should approach 250 for stable estimates.
...Consider, for example, a correlation of r = .40 in a sample of 25 participants. This correlation is significantly different from zero (p=.047). Hence, it might be concluded with some confidence that there is "something > 0" in the population, and the study would be counted as a success from the NHST perspective. However, plausible values of the true correlation ρ, as expressed by a 90% confidence interval, range from .07 to .65. The estimate is quite unsatisfactory from an accuracy point of view – in any scenario beyond the NHST ritual it will make a huge difference whether the true correlation in the population is .07, which would be regarded as a very small effect in most research contexts, or .65, which would be a very large effect in many contexts. Moreover, precise point estimates are relevant for a priori sample size calculations. Given the huge uncertainty in the true magnitude of the effect, it is hard to determine the necessary sample size to replicate the effect (e.g., for an intended power of 80% and ρ = .07: n = 1599 [!]; ρ = .40: n = 46; and for ρ = .65: n = 16).
...the following empirical example demonstrates. Multiple questionnaire scales have been administered in an open online study (Schönbrodt & Gerstenberg, 2012; Study 3). The thick black line in Figure 1 shows the evolution of the correlation between two scales, namely “hope of power” and “fear of losing control” when after each new participant the correlation is recalculated. It can be seen that the correlation evolved from r = .69 (n = 20, p < .001) to r = .26 (n = 274, p < .001). From a visual inspection, the trajectory did not stabilize up to a sample size of around 150. Data have not been rearranged – it is simply the order how participants dropped into the study. Some other correlations in this data set evolved from significantly negative to non-significant, others changed from one significant direction into the significant opposite, and some correlations were stable right from the beginning with only few fluctuations around the final estimate. But how do we get to know when a correlation estimate is sufficiently stable?
http://www.nicebread.de/WP/wp-content/uploads/2013/06/evolDemo.jpg
Figure 1: Actual (thick black line) and bootstrapped (thin gray lines) trajectories of a correlation. The dotted curved lines show the 95% confidence interval for the final correlation of r = .26 at each n. Dashed lines show the ± .1 corridor of stability (COS) around the final correlation. The point of stability (POS) is at n = 161. After that sample size the actual trajectory does not leave the COS.
...To assess the variability of possible trajectories, bootstrap samples of the final sample size can be drawn from the original raw data, and the evolutions of correlation for the new data sets are calculated. Figure 1 shows some exemplary bootstrapped trajectories. It can be seen that some trajectories start well above the final value (as the original trajectory), some start even with a significant negative value, and some start already within the COS without ever leaving it.
...The desired width of the corridor depends on the specific research context (see Figure 1 for a COS with w = .10). In this paper, three widths are used: ± .10, ± .15, and ± .20. Following the rules of thumb proposed by Cohen (1992), a value of .10 for w corresponds to a small effect size. Hence, if the sample correlation r stays within a corridor with w = ± .10, the resulting deviations only have a small effect size.
...In an analysis of 440 large-scale real world data sets in psychology only 4.3% could be considered as reasonable approximations to a Gaussian normal distribution (Micceri, 1989). Hence, deviations from normality are rather the rule than the exception in psychology. [Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166. doi:10.1037/0033-2909.105.1.156 http://faculty1.ucmerced.edu/sdepaoli/docs/Micceri%201989.pdf ]
...If Table 1 should be boiled down to simple answers, one can ask what effect size typically can be expected in personality. In a meta-meta-analysis summarizing 322 meta-analyses with more than 25'000 published studies in the field of personality and social psychology, Richard, Bond, and Stokes-Zoota (2003) ["One Hundred Years of Social Psychology Quantitatively Described" http://jenni.uchicago.edu/Spencer_Conference/Representative%20Papers/Richard%20et%20al,%202003.pdf ] report that the average published effect is r = .21, less than 25% of all meta-analytic effects sizes are greater than .30, and only 5.28% of all effects are greater than .50. Hence, without any specific prior knowledge it would be sensible to assume an effect size of .214. Further let's assume that a confidence level of 80% is requested (a level that is typically used for statistical power analyses), and only small effect sizes (w < .10) are considered as acceptable fluctuations. By applying these values on Table 1 the required sample size is around n = 238.
Of course, what is a meaningful or expected correlation can vary depending on the research context and questions. In some research contexts even small correlations of .10 might be meaningful and with consequential implications. In this case, larger samples are needed for stable correlations. In other research contexts the expected correlation can be greater (e.g., convergent validity between different measures of the same trait) or the researcher is willing to accept a slightly less stable estimate, perhaps compensating with an increased level of confidence. This would reduce the necessary sample size. But even under these conditions there are few occasions in which it may be justifiable to go below n = 150 and for typical research scenarios reasonable trade-offs between accuracy and confidence start to be achieved when n approaches 250.