Press question mark to see available shortcut keys

"Site Selection Bias in Program Evaluation", Allcott 2014 (on one reason programs 'fade out'; of particular concern in effective altruism); excerpts:

""Site selection bias" occurs when the probability that partners adopt or evaluate a program is correlated with treatment effects. I test for site selection bias in the context of the Opower energy conservation programs, using 111 randomized control trials (RCTs) involving 8.6 million households across the United States. Predictions based on rich microdata from the first ten replications substantially overstate efficacy in the next 101 sites. There is evidence of two positive selection mechanisms. First, local populations with stronger preferences for environmental conservation both encourage utilities to adopt the program and are more responsive to the treatment. Second, program managers initially target treatment at the most responsive consumer sub-populations, meaning that efficacy drops when utilities expand the program. While it may be optimal to initially target an intervention toward the most responsive populations, these results show how analysts can be systematically biased when extrapolating experimental results, even after many replications. I augment the Opower results by showing that microfinance institutions (MFIs) that run RCTs differ from the global population of MFIs and that hospitals that host clinical trials differ from the national population of hospitals.

When generalizing empirical results, we either implicitly or explicitly make an assumption I call external unconfoundedness: that there are no unobservables that moderate the treatment effect and differ between sample and target. As formalized by Hotz, Imbens, and Mortimer (2005) and Hotz, Imbens, and Klerman (2006), this type of assumption mirrors the unconfoundedness assumption required for internal validity (Rosenbaum and Rubin 1983). When we have results from only one site, this assumption amounts to assuming away the possibility of unexplained site-level treatment effect heterogeneity. Because this is often unrealistic, we value replication in additional sites. After enough replications, external unconfoundedness implies only that the distribution of treatment effects in sample sites can predict the distribution of effects in target sites. Put simply, if an intervention works well in enough different trials, we might advocate that it be scaled up. Formally, this logic requires that sample sites are as good as randomly selected from the population of target sites.
In practice, there are many reasons why sites are selected for empirical study. For example, because RCTs require an implementing partner with managerial ability and operational expertise, the set of actual partners may be able to run more effective programs than the typical potential partner. As another example, partners that are already running programs that they know are effective are more likely to be open to independent impact estimates (Pritchett 2002). Both of these mechanisms would cause a positive site selection bias: treatment effects from sample sites would be larger than in the full set of targets. Alternatively, partners that are particularly innovative and willing to test new programs may also be running many other effective programs in the same area. If there are diminishing returns, the new program with an actual partner might have lower impact than with a typical potential partner. This would cause negative site selection bias. Site selection bias implies that even with a large number of internally valid impact estimates, policymakers could still draw systematically biased inference about a program's impact at full scale.

[There might also be a 'diminishing returns' effect: the sites that need a program the most will be most likely to be willing to try it; hence, when rolled out to less-needing sites, returns will have diminished.]

Given the cost of RCTs, it is unusual for the same intervention to be rigorously evaluated at more than a small handful of sites. By contrast, as in LaLonde (1986), Dehejia and Wahba (1999), Heckman, Ichimura, Smith, and Todd (1998), Smith and Todd (2004), and many other studies, providing evidence on individual-level selection bias and an estimator's internal validity is much less onerous, as it simply requires a large sample of individuals.
The Opower energy conservation program provides an exceptional opportunity to study the site selection process. The treatment is to mail "Home Energy Reports" to residential electricity consumers that provide energy conservation tips and compare their energy use to that of their neighbors. Electric and natural gas utilities have partnered with Opower largely because the program helps to comply with state-level energy conservation mandates. As of February 2013, the program had been implemented using 111 randomized control trials involving 8.6 million households at 58 utilities across the United States.

I begin by using microdata from Opower's first ten replications to predict aggregate nationwide effects. This is a highly promising opportunity for out-of-sample prediction: there are very large samples, with 512,000 households in treatment and control, ten different replications spread throughout the country, internally-valid estimates, and a useful set of individual level covariates to adjust for differences between sample and target populations. Using the non-parametric test of treatment effect heterogeneity introduced by Crump, Hotz, Imbens, and Mitnik (2008), I show that treatment effects are larger for households that use more electricity and also vary in intuitive ways conditional on four other features of a home's physical capital stock. I then use two standard "off-the-shelf" approaches to extrapolation: linear prediction and the re-weighting procedure introduced by Hellerstein and Imbens (1999). Results from these first ten replications predict retail electricity cost savings of about 1.7 percent, or $2.3 billion, in the first year of a nationally-scaled program.
Aside from the microdata, I also have Opower's "metadata": impact estimates and standard errors from all 111 RCTs at all 58 different utility partners that began before February 2013. As an "in-sample" test of external unconfoundedness, I use microdata from the initial ten sites to predict first-year effects at the 101 later sites. The microdata over-predict efficacy by approximately 0.5 percentage points, or $690 million worth of retail electricity. In other words, early sites were strongly positively selected through mechanisms associated with the treatment effect.

There is statistically and economically significant evidence of selection through what one might call "population preferences": consumers in areas with high income, high education, and stronger preferences for environmental conservation both encourage utilities to adopt the Opower program and are more responsive to the intervention once it is implemented.

A simple scatterplot of first-year treatment effects against site start date shows clearly declining efficacy for later sites. This downward trend exists both between utilities that partner earlier vs. later and within utility at earlier vs. later customer sub-populations. A substantial portion of the within-utility trend is explained by the fact that a utility's earlier sites involve higher-usage consumers who are more responsive. Typically, if the program works well in an initial customer sub-population, the utility will contract with Opower to implement at additional sites within their service territory. Conditioning on site level observables in the form of the mechanism scores explains just over one-quarter of the between-utility trend.

This paper also does not argue that RCTs are not useful and important. In the Opower context, this would be particularly untrue: as shown by Allcott (2011), non-experimental approaches to evaluating Opower programs give highly misleading estimates.

- Allcott, Hunt (2011). "Social Norms and Energy Conservation" http://dspace.mit.edu/bitstream/handle/1721.1/51712/2009-014.pdf?sequence=1 . Journal of Public Economics, Vol. 95, No. 9-10 (October), pages 1082-1095.

The appendix also provides brief suggestive evidence from two other domains on how RCT sites differ systematically from policy-relevant target sites. First, I study microfinance institutions (MFIs) that have partnered to carry out randomized trials with three academic groups: the Jameel Poverty Action Lab, Innovations for Poverty Action, and the Financial Access Initiative. I show that partner MFIs differ from the global population of MFIs on characteristics that might moderate effects of various interventions, including correlates of default rates, organizational structure, and monitoring and implementation ability. Second, I study hospitals that are the sites of clinical trials for new drugs and surgical procedures. I show that clinical trial sites tend to be larger, more experienced in surgical procedures, offer a wider range of technologies and patient services, and are generally higher-quality than the average US hospital. Because both microfinance RCTs and clinical trials test a variety of different interventions, it is not possible to correlate selection probability with consistently-defined treatment effects as one can for the Opower experiments.
However, these additional examples suggest that site selection bias is probably not unique to energy conservation RCTs. This approach of comparing RCT partner sites to non-partner sites has also been implemented by Blair, Iyengar, and Shapiro (2013), who show that field experiments in economics and political science are more likely to be carried out in wealthy, democratic countries that spend more on citizen welfare. The MFI results are related to Brigham et al. (2013), who experimentally test the willingness of MFIs to partner on RCTs.

Opower is a private company that partners with utilities to mail Home Energy Reports to residential electricity and natural gas consumers. Utilities partner with Opower and run other energy conservation programs for several reasons. Most importantly, there are 27 states with Energy Efficiency Resource Standards (EERS), which require utilities to reduce energy use by a given amount, typically about one percent per year. In the absence of an EERS or other regulatory mechanism, for-profit investor-owned utilities (IOUs) have little incentive to reduce demand for the product they sell. Rural electric cooperatives and other utilities owned by municipalities or other government entities should maximize welfare instead of profits, so they often run energy efficiency programs if they believe the programs benefit customers...Some small utilities choose to target the entire residential consumer base, while others target heavy users who might be most responsive to the intervention, and others target local geographic areas where conservation could help to delay costly upgrades to distribution infrastructure. To be eligible for the program, a customer must have at least one year of valid pre-experiment energy use data and satisfy some additional technical conditions.

Due to contractual restrictions, Opower cannot share microdata from many of their recent partners. Instead, they have provided their site-level metadata, including average treatment effects and standard errors, number of reports sent, and attrition for each post-treatment month of each RCT. Consistent with the theoretical framework in Section 3, I define a "site" as a group of households where one experiment takes place. Some utilities have multiple "sites," because they began with one customer sub-population and then added other sub-populations in separate randomized control trials at a later date. As of February 2014, there were 111 sites with at least one year of post-treatment data at 58 different utilities.

Data in Table 1 suggest that the variation in effects across sites is larger than can be explained by sampling error: the standard deviation of the 111 site-level ATEs is 0.45 percent of electricity usage, while the average standard error is only 0.18 percent. More formally, Cochran's Q test rejects that the effects are homogeneous with a p-value of less than 0.001. The I 2 statistic (Higgins and Thompson 2002) shows that 86.6 percent of the variation in effect sizes is due to true heterogeneity instead of sampling variation. Effectively none of this variation is due to variation in treatment intensity as measured by frequency: the standard deviation of frequency-adjusted ATEs and their mean standard error are 0.44 percent and 0.18 percent, respectively, and the I 2 is 85.6 percent. One measure of economic significance is the dollar magnitude of the variation in predicted effects at scale. Figure 2 presents a horizontally-oriented forest plot of the predicted electricity cost savings in the first year of a program that is expanded "nationally," i.e. to all households at all potential partner utilities. Each dot reflects the prediction using the percent ATE from each site, multiplied by national annual electricity costs. The point estimates of first-year savings vary by a factor of 5.2, from $695 million to $3.62 billion, and the standard deviation is $618 million. A second measure of economic significance is the variation in cost effectiveness, as presented in Figure 3. While there are many ways to calculate cost effectiveness, I present the simplest: the ratio of program cost to kilowatt-hours conserved during the first two years. 7 As Allcott and Rogers (2014) point out, cost effectiveness improves substantially when evaluating over longer time horizons; I use two years here to strike a balance between using longer time horizons to calculate more realistic levels vs. using shorter time horizons to include more sites with sufficient posttreatment data. I make a boilerplate cost assumption of $1 per report. The variation is again quite substantial. The most cost effective (0.88 cents/kWh) is 14 times better than the least cost effective, and the 10th percentile is four times better than the 90th percentile. The site on the right of the figure with outlying poor cost effectiveness is a small program with extremely weak ATE and high cost due to frequent reports.
This variation is economically significant in the sense that it can cause program adoption errors: program managers at a target site might make the wrong decision if they extrapolate cost effectiveness from another site in order to decide whether to implement the program. Alternative energy conservation programs have been estimated to cost approximately five cents per kilowatt-hour (Arimura, Li, Newell, and Palmer 2011) or between 1.6 and 3.3 cents per kilowatt-hour (Friedrich et al. 2009). These three values are plotted as horizontal lines on Figure 3. Whether an Opower program at a new site has cost effectiveness at the lower or upper end of the range illustrated in Figure 3 therefore could change whether a manager would or would not want to adopt.

Using these standard approaches and assuming external unconfoundedness, the predicted nationwide effects would be about 1.7 percent of electricity use in the first year of the program. This amounts to 21 terawatt-hours, or about the annual output of three large coal power plants. At retail prices, the electricity cost savings would be $2.3 billion in the first year. The results from the 101 "later sites" provide an opportunity to explicitly test the external unconfoundedness assumption. The right panel of Figure 4 shows the linear and weighted fits for the average of the 101 sites, along with the unweighted mean. The predicted effects are 0.42 and 0.52 percentage points larger than the true effects. When scaled to the national level, a misprediction of 0.47 percentage points would amount to an overstatement of the effects by 5.9 terawatt-hours in the program's first year, or $650 million in retail electricity cost savings. These differences reflect site selection bias: the failure of the external unconfoundedness assumption even after 10 replications. This bias exists despite what approaches a "best case scenario" for extrapolation: internally valid results, large samples, a set of observables that moderate the treatment effect, and a relatively large number of replications. The next section explores how this bias came about.

Before presenting the estimates of Equation (11), I first provide statistical results for the trend in efficacy illustrated by Figure 5. Column 1 of Table 7 estimates the slope of the line in Figure 5, regressing the frequency-adjusted ATE for each of the 111 sites on its start date. Sites that start one year later average 0.173 percentage point smaller ATEs. The next two columns divide this into between-utility and within-utility trends. Column 2 limits the sample to each utility's first site, showing a between-utility downward trend. Several of the 58 utilities have multiple sites that start on the same date, so this regression has 66 observations. Column 3 includes indicators for each of the 58 utilities and reports the association between τ e s and site start number M s . On average, a utility's next site performs 0.091 percentage points worse than its previous site.

How can researchers address site selection bias? At the partner recruitment stage, we can hypothesize potential U 's (moderators that are econometrically "unobserved" in sample data) and try to replicate in additional sites with different U 's. Even if it is not possible to estimate the relationship between U and the treatment effect, such a strategy would produce a more realistic distribution of site-level heterogeneity. A second way to minimize site selection bias is to intensify partner recruitment efforts on exactly the types of partners who are less interested in RCTs. This approach is analogous to using intensive follow-up for all or some individuals to minimize and quantify the effects of individual-level sample attrition, as in the Moving to Opportunity experiment (DiNardo, McCrary, and Sanbonmatsu 2006).
Several steps can also be taken when reporting results of RCTs. First, the researcher can explicitly define the target population and provide information on characteristics of individuals and partners in both sample and target, just as in Tables 3 and 5. Second, when target site characteristics are available, site selection probabilities may be useful, both in a test of site selection on observables and as a way to parsimoniously control for observables when extrapolating."
Shared publiclyView activity