"Why We (Usually) Don’t Have to Worry About Multiple Comparisons"; Gelman et al 2012
abstract: Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.
Our approach, as described in this article, has two key differences from the classical perspective. First, we are typically not terribly concerned with Type 1 error because we rarely believe that it is possible for the null hypothesis to be strictly true. Second, we believe that the problem is not multiple testing but rather insufficient modeling of the relationship between the corresponding parameters of the model. Once we work within a Bayesian multilevel modeling framework and model these phenomena appropriately, we are actually able to get more reliable point estimates. A multilevel model shifts point estimates and their corresponding intervals toward each other (by a process often referred to as “shrinkage” or “partial pooling”), whereas classical procedures typically keep the point estimates stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). In this way, multilevel estimates make comparisons appropriately more conservative, in the sense that intervals for comparisons are more likely to include zero. As a result we can say with confidence that those comparisons made with multilevel estimates are more likely to be valid. At the same time this “adjustment” does not sap our power to detect true differences as many traditional methods do.
http://i.imgur.com/roRFgLr.png
Figure 1. Treatment effect point estimates and 95% intervals across the eight Infant Health and Development Program sites. Note. The left panel display classical estimates from a linear regression. The middle panel displays the same point estimates as in the left panel but with confidence intervals adjusted to account for a Bonferroni correction. The right panel displays posterior means and 95% intervals for each of the eight site-specific treatment effects from a fitted multilevel model.
...Multilevel modeling can be thought of as a compromise between two extremes. One extreme, complete pooling, would assume the treatment effects are the same across all sites, that is, δj = δ, for all j . The other extreme, no pooling, would estimate treatment effects separately for each site. The compromise found in the multilevel modeld is often referred to as partial pooling. Figure 1 graphically illustrates this compromise with a plot of the multilevel intervals next to the classical estimates and intervals (with and without Bonferroni corrections). The horizontal dashed line in each plot displays the complete pooling estimate. We also display a horizontal solid line at zero to quickly show which estimates would be considered to be statistically significant. This process leads to point estimates that are closer to each other (and to the “main effect” across all sites) than the classical analysis. Rather than inflating our uncertainty estimates, which does not really reflect the information we have regarding the effect of the program, we have shifted the point estimates toward each other in ways that reflect the information we have. (More generally, if the model has group-level predictors, the inferences will be partially pooled toward the fitted group-level regression surface rather than to a common mean.)
The Intuition. Why does partial pooling make sense at an intuitive level? Let’s start from the basics. The only reason we have to worry about multiple comparisons issues is because we have uncertainty about our estimates. If we knew the true (population-average) treatment effect within each site, we would not be making any probabilistic statements to begin with—we would just know the true sign and true magnitude of each (and certainly then whether each was really different from 0 or from each other). Classical inference in essence uses only the information in each site to get the treatment effect estimate in that site and the corresponding standard error.
A multilevel model, however, recognizes that this site-specific estimate is actually ignoring some important information—the information provided by the other sites. While still allowing for heterogeneity across sites, the multilevel model also recognizes that because all the sites are measuring the same phenomenon it does not make sense to completely ignore what has been found in the other sites. Therefore each site-specific estimate gets “shrunk” or pulled toward the overall estimate (or, in a more general setting, toward a group-level regression fit). The greater the uncertainty in a site, the more it will get pulled toward the overall estimate. The less the uncertainty in a site, the more we trust that individual estimate and the less it gets shrunk.
To illustrate this point we ran our multilevel model on slightly altered versions of the data set. In the first altered version we decreased the sample size in Site 3 from 138 to a random sample of 30; results are displayed in the center panel of Figure 3. In the second altered version we increased the sample size in Site 3 to 300 by bootstrapping the original observations in that site; results are displayed in the right panel of Figure 3. The leftmost panel displays the original multilevel model results. The key observation is that the shrinkage of the Site 3 treatment effect estimate changes drastically across these scenarios because the uncertainty of the estimate relative to that of the grand mean also changes drastically across these scenarios. Note, however, that the overall uncertainty increases in the rightmost plot even though the sample size in Site 3 increases. That is because we increased the sample size while keeping the point estimate the same. This leads to greater certainty about the level of treatment effect heterogeneity across sites, and thus greater uncertainty about the overall mean.
http://i.imgur.com/wsCJQxA.png Figure 3
...The first factor in this last expression is the z score for the classical unpooled estimates; the second factor is the correction from partial pooling, a correction that is always less than 1 (i.e., it reduces the z score) and approaches zero as the group-level variance σ~θ~^2 approaches zero; see Figure 4.
Greenland and Robins (1991) make a similar argument about the advantages of partial pooling, going so far as to frame the multiple comparisons problem as an
"opportunity to improve our estimates through judicious use of any prior information (in the form of model assumptions) about the ensemble of parameters being estimated. Unlike conventional multiple comparisons, EB [empirical Bayes] and Bayes approaches will alter and can improve point estimates and can provide more powerful tests and more precise (narrower) interval estimators." (p. 249)
To get further insight into this example, we perform repeated simulations of a world in which the true treatment effects in different schools come from a normal distribution with standard deviation of 5 (a plausible estimate given the data in Table 1). For each replication, we simulate eight true values θ1 , . . . , θ8 from this distribution, then simulate data y1 , . . . , y8 from the eight separate normal distributions corresponding to each θj . The standard deviations σj for each of these distributions is given by Figure 1. Relative to the within-group standard deviations, the between-group standard deviation of 5 is small. We then performed both classical and hierarchical Bayesian analyses. For each analysis, we computed all (8 · 7)/2 = 28 comparisons and count the number that are statistically significant (i.e., where the difference between the estimates for two schools is more than 1.96 times the standard error for the difference), and of these, we count the number that have the correct sign.
We performed 1,000 simulations. Out of these simulations, 7% of the classical intervals were statistically significant, and of these, only 63% got the sign of the comparison correct. Multiple comparisons corrections are clearly necessary here if we want to avoid making unreliable statements. By comparison, only 0.5% of the Bayesian intervals are statistically significant (with 89% getting the sign of the comparison correct). The shrinkage of the Bayesian analysis has already essentially done a multiple comparisons correction.
To look at it another way: The classical estimates found at least one statistically significant comparison in 47% of our 1,000 simulations. In the Bayesian estimates, this occurred only 5% of the time. The Bayesian analysis here uses a uniform prior distribution on the hyperparameters—the mean and standard deviation of the school effects—and so it uses no more information than the classical analysis. As with a classical multiple comparisons procedure, the Bayesian inference recognizes the uncertainty in inferences and correspondingly reduces the number of statistically significant comparisons.
Harder problems arise when modeling multiple comparisons that have more structure. For example, suppose we have five outcome measures, three varieties of treatments, and subgroups classified by two sexes and four racial groups. We would not want to model this 2 × 3 × 4 × 5 structure as 120 exchangeable groups. Even in these more complex situations, we think multilevel modeling should and will eventually take the place of classical multiple comparisons procedures. After all, classical multiple comparisons strategies themselves assume exchangeability in the sense of treating all the different comparisons symmetrically. And so, in either case, further work is needed for the method to match the problem structure. For large problems, there can be more data for estimating variance parameters in multilevel models (this is sometimes called the blessing of dimensionality). Similarly, classical procedures may have the potential to adaptively vary tuning parameters in large, complex structures.
#statistics #bayesian
abstract: Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.
Our approach, as described in this article, has two key differences from the classical perspective. First, we are typically not terribly concerned with Type 1 error because we rarely believe that it is possible for the null hypothesis to be strictly true. Second, we believe that the problem is not multiple testing but rather insufficient modeling of the relationship between the corresponding parameters of the model. Once we work within a Bayesian multilevel modeling framework and model these phenomena appropriately, we are actually able to get more reliable point estimates. A multilevel model shifts point estimates and their corresponding intervals toward each other (by a process often referred to as “shrinkage” or “partial pooling”), whereas classical procedures typically keep the point estimates stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). In this way, multilevel estimates make comparisons appropriately more conservative, in the sense that intervals for comparisons are more likely to include zero. As a result we can say with confidence that those comparisons made with multilevel estimates are more likely to be valid. At the same time this “adjustment” does not sap our power to detect true differences as many traditional methods do.
http://i.imgur.com/roRFgLr.png
Figure 1. Treatment effect point estimates and 95% intervals across the eight Infant Health and Development Program sites. Note. The left panel display classical estimates from a linear regression. The middle panel displays the same point estimates as in the left panel but with confidence intervals adjusted to account for a Bonferroni correction. The right panel displays posterior means and 95% intervals for each of the eight site-specific treatment effects from a fitted multilevel model.
...Multilevel modeling can be thought of as a compromise between two extremes. One extreme, complete pooling, would assume the treatment effects are the same across all sites, that is, δj = δ, for all j . The other extreme, no pooling, would estimate treatment effects separately for each site. The compromise found in the multilevel modeld is often referred to as partial pooling. Figure 1 graphically illustrates this compromise with a plot of the multilevel intervals next to the classical estimates and intervals (with and without Bonferroni corrections). The horizontal dashed line in each plot displays the complete pooling estimate. We also display a horizontal solid line at zero to quickly show which estimates would be considered to be statistically significant. This process leads to point estimates that are closer to each other (and to the “main effect” across all sites) than the classical analysis. Rather than inflating our uncertainty estimates, which does not really reflect the information we have regarding the effect of the program, we have shifted the point estimates toward each other in ways that reflect the information we have. (More generally, if the model has group-level predictors, the inferences will be partially pooled toward the fitted group-level regression surface rather than to a common mean.)
The Intuition. Why does partial pooling make sense at an intuitive level? Let’s start from the basics. The only reason we have to worry about multiple comparisons issues is because we have uncertainty about our estimates. If we knew the true (population-average) treatment effect within each site, we would not be making any probabilistic statements to begin with—we would just know the true sign and true magnitude of each (and certainly then whether each was really different from 0 or from each other). Classical inference in essence uses only the information in each site to get the treatment effect estimate in that site and the corresponding standard error.
A multilevel model, however, recognizes that this site-specific estimate is actually ignoring some important information—the information provided by the other sites. While still allowing for heterogeneity across sites, the multilevel model also recognizes that because all the sites are measuring the same phenomenon it does not make sense to completely ignore what has been found in the other sites. Therefore each site-specific estimate gets “shrunk” or pulled toward the overall estimate (or, in a more general setting, toward a group-level regression fit). The greater the uncertainty in a site, the more it will get pulled toward the overall estimate. The less the uncertainty in a site, the more we trust that individual estimate and the less it gets shrunk.
To illustrate this point we ran our multilevel model on slightly altered versions of the data set. In the first altered version we decreased the sample size in Site 3 from 138 to a random sample of 30; results are displayed in the center panel of Figure 3. In the second altered version we increased the sample size in Site 3 to 300 by bootstrapping the original observations in that site; results are displayed in the right panel of Figure 3. The leftmost panel displays the original multilevel model results. The key observation is that the shrinkage of the Site 3 treatment effect estimate changes drastically across these scenarios because the uncertainty of the estimate relative to that of the grand mean also changes drastically across these scenarios. Note, however, that the overall uncertainty increases in the rightmost plot even though the sample size in Site 3 increases. That is because we increased the sample size while keeping the point estimate the same. This leads to greater certainty about the level of treatment effect heterogeneity across sites, and thus greater uncertainty about the overall mean.
http://i.imgur.com/wsCJQxA.png Figure 3
...The first factor in this last expression is the z score for the classical unpooled estimates; the second factor is the correction from partial pooling, a correction that is always less than 1 (i.e., it reduces the z score) and approaches zero as the group-level variance σ~θ~^2 approaches zero; see Figure 4.
Greenland and Robins (1991) make a similar argument about the advantages of partial pooling, going so far as to frame the multiple comparisons problem as an
"opportunity to improve our estimates through judicious use of any prior information (in the form of model assumptions) about the ensemble of parameters being estimated. Unlike conventional multiple comparisons, EB [empirical Bayes] and Bayes approaches will alter and can improve point estimates and can provide more powerful tests and more precise (narrower) interval estimators." (p. 249)
To get further insight into this example, we perform repeated simulations of a world in which the true treatment effects in different schools come from a normal distribution with standard deviation of 5 (a plausible estimate given the data in Table 1). For each replication, we simulate eight true values θ1 , . . . , θ8 from this distribution, then simulate data y1 , . . . , y8 from the eight separate normal distributions corresponding to each θj . The standard deviations σj for each of these distributions is given by Figure 1. Relative to the within-group standard deviations, the between-group standard deviation of 5 is small. We then performed both classical and hierarchical Bayesian analyses. For each analysis, we computed all (8 · 7)/2 = 28 comparisons and count the number that are statistically significant (i.e., where the difference between the estimates for two schools is more than 1.96 times the standard error for the difference), and of these, we count the number that have the correct sign.
We performed 1,000 simulations. Out of these simulations, 7% of the classical intervals were statistically significant, and of these, only 63% got the sign of the comparison correct. Multiple comparisons corrections are clearly necessary here if we want to avoid making unreliable statements. By comparison, only 0.5% of the Bayesian intervals are statistically significant (with 89% getting the sign of the comparison correct). The shrinkage of the Bayesian analysis has already essentially done a multiple comparisons correction.
To look at it another way: The classical estimates found at least one statistically significant comparison in 47% of our 1,000 simulations. In the Bayesian estimates, this occurred only 5% of the time. The Bayesian analysis here uses a uniform prior distribution on the hyperparameters—the mean and standard deviation of the school effects—and so it uses no more information than the classical analysis. As with a classical multiple comparisons procedure, the Bayesian inference recognizes the uncertainty in inferences and correspondingly reduces the number of statistically significant comparisons.
Harder problems arise when modeling multiple comparisons that have more structure. For example, suppose we have five outcome measures, three varieties of treatments, and subgroups classified by two sexes and four racial groups. We would not want to model this 2 × 3 × 4 × 5 structure as 120 exchangeable groups. Even in these more complex situations, we think multilevel modeling should and will eventually take the place of classical multiple comparisons procedures. After all, classical multiple comparisons strategies themselves assume exchangeability in the sense of treating all the different comparisons symmetrically. And so, in either case, further work is needed for the method to match the problem structure. For large problems, there can be more data for estimating variance parameters in multilevel models (this is sometimes called the blessing of dimensionality). Similarly, classical procedures may have the potential to adaptively vary tuning parameters in large, complex structures.
#statistics #bayesian
Multiple comparisons in hierarchical models are discussed in this video (starting about 7:50 in, with summary around 13:30) http://youtu.be/YyohWpjl6KUJan 29, 2013