How often does correlation=causality? Fraker & Maynard 1987 offers an interesting datapoint: they exploited a government jobs program which included a randomized experiment plus data on other enrollees to let them calculate both the true average causal effect and a number of alternative regressions (like those an analyst would normally do given that dataset minus the experimental portion).
You will not be surprised that the regressions spit out completely wrong and even systematically biased estimates (roughly: 0 of 12 estimates were reasonably similar to the causal effect for one job program and 4/12 for another job program; and the estimates were systematically far too negative with the first job program but less biased for the second).
Was the effect of those job programs extremely harmful, harmful, neutral, or positive? If you'd settled for a correlation, you could have wound up being extremely wrong...
"The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs", Fraker & Maynard 1987 https://pdf.yt/d/AmuYucmgh2MghBok / https://www.dropbox.com/s/3799hv83u14yudn/1987-fraker.pdf
"This study investigates empirically the strengths and limitations of using experimental versus nonexperimental designs for evaluating employment and training programs. The assessment involves comparing results from an experimental-design study - the National Supported Work Demonstration - with the estimated impacts of Supported Work based on analyses using comparison groups constructed from the Current Population Surveys. The results indicate that nonexperimental designs cannot be relied on to estimate the effectiveness of employment programs. Impact estimates tend to be sensitive both to the comparison group construction methodology and to the analytic model used. There is currently no way a priori to ensure that the results of comparison group studies will be valid indicators of the program impacts.
Beginning with the OEO-sponsored Negative Income Tax Experiments conducted in the late 1960s and early 1970s, the use of randomized experiments gained wide acceptance in evaluations of health, education, welfare, and labor policies and programs. However, contrary to the strong recommendations of leading labor economists in support of experimentation (see, for example, Ashenfelter 1975), researchers continued to use nonexperimental designs, especially in evaluations of ongoing programs like WIN and CETA and in evaluations with limited funding. Virtually all of the evaluations of the major ongoing federal employment and training programs (i.e., CETA, WIN, and Job Corps) have relied on analytic methodologies that use no comparison group or that have defined comparison groups ex-post from existing sampling frames (see, for example, Ashenfelter 1979, Westat 1980, King and Geraci 1982, Bassi 1984, Bassi et al. 1984, Dickinson et al. 1984, Kiefer 1979, Maller et al. 1982, and Ketron, Inc. 1980). These nonexperimental studies suffer from one major limitation: the integrity of their results rests on untestable assumptions about the adequacy of the analytic model and the unmeasurable characteristics of the participant and comparison groups. Furthermore, the net impact estimates vary widely across studies of a given program due to the use of different model specifications and/or comparison groups. For example, estimates of the net impacts of CETA on the earnings of youth range from large negative impacts to essentially no impacts; those on the earnings of women range from no impacts to large positive impacts; and those on the earnings of adult males range from small positive to large negative impacts (see Barnow 1987, LaLonde and Maynard 1986).
This high variability in the program impact estimates based on comparison group designs has prompted several analyses aimed at assessing the merits of experimental versus nonexperimental designs for employment-training evaluations (see, for example, Ashenfelter and Card 1985, LaLonde 1984, and Burtless and Orr 1986). Our response was to undertake an empirical assessment of the reliability of program impact estimates generated through the nonexperimental methodologies that have been used widely in employment-training evaluations during the past decade. Central to our assessment is a comparison of results from two evaluations of the net impacts of the National Supported Work Demonstration. One set of results is based upon control groups that were selected in accordance with the demonstration's experimental design, while the other set is based upon comparison groups constructed from the Current Population Surveys. The results of our study indicate that nonexperimental design evaluations cannot be relied on to estimate the effectiveness of programs like Supported Work with sufficient precision (and in some cases unbiasedness)
...We demonstrate that program impact estimates may differ substantially between those generated using randomly selected control groups and those generated using comparison groups. We also observe that the impact estimates tend to be sensitive to both the method used to construct comparison groups and the specification of the analytic model.
In conducting this study, we have taken advantage of a unique opportunity provided by a major national experiment, the National Supported Work Demonstration, to explore the adequacy of nonexperimental study designs for evaluating employment and training programs.
The National Supported Work Demonstration, conducted between 1975 and 1979, was a field test of the effectiveness of a highly structured work experience program in mitigating the employment problems of four groups of persons with severe employment disabilities: young school dropouts, AFDC recipients, ex-drug addicts, and ex-offenders. Based on a control-group methodology, Supported Work was found to have increased significantly the employment and earnings of all four target groups during the period of program participation (see Hollister et al. 1984). However, only the AFDC recipients showed evidence of postprogram earnings gains. These longer-term impacts for the AFDC recipients were in the range of 5 to 10 percentage point increases in employment rates and $50 to $80 increases in average monthly earnings.
Our study of the sensitivity of net impact estimates to the evaluation design methodology focuses on the Supported Work Demonstration for three reasons. First, the intervention was similar to the work experience treatments within other employment-training programs. Second, the Supported Work data are sufficiently similar to those used in the previous MDTA and CETA evaluations to permit us to replicate the nonexperimental methodologies used in those prior studies, while offering us the advantage of the control group that can be used to obtain unbiased impact estimates for use as the assessment criteria. Third, there is substantial overlap in the target populations that were served by the Supported Work Demonstration, MDTA, CETA, and currently are served by JTPA.
We further focused our assessment on only two of the four Supported Work target groups: AFDC recipients and youth. These two groups are similar in important respects to youth and adult female participants in MDTA, CETA, and JTPA, and nominally similar individuals can be identified in data sets from which comparison groups might be selected. In contrast, data on the defining attributes of the Supported Work ex-offender and ex-addict target groups - their criminal histories and their drug use - are not available on the data bases that are potential sources of comparison groups.
A. The Supported Work Sample and Data
The Supported Work sample includes 1,244 school dropouts (566 experimentals and 678 controls) ages 17 to 20 years and 1,602 female long-term recipients of AFDC, none of whom had a child younger than age 6 (800 experimentals and 802 controls)...The typical person in the AFDC sample was 34 years old, was black, had ten years of schooling, had two dependents, and had a youngest child between the ages of six and twelve. The average welfare payment was about $280 per month, with an additional average food stamp bonus value of about $70. The women had received welfare for almost nine years on average, and the average length of time since the last regular job for those with some prior work experience was nearly four years (see Masters and Maynard 1984).
The reliability of program impact estimates depends critically on obtaining good estimates of what the outcomes for the participant group would have been had this group not received program services. The best way to obtain such estimates is through the use of an experimental design whereby a random subset of the eligible program applicants is assigned to a no-treatment control group.10
The availability of such a control group is one of the unique features of the National Supported Work Demonstration. In this demonstration, program impacts can be measured quite simply by comparing the mean values of the outcomes for experimentals and controls.
However, as noted above, most employment-training evaluations have not had the benefit of a control group and, therefore, have relied on comparison groups constructed from non0program data bases.
We constructed a "basic" comparison group for both the Supported Work Youth and AFDC samples using cell matching procedures similar to those used in much of the prior CLMS research (see, for example, Westat 1980 and Dickinson et al. 1984). Essentially, we selected cases from the CPS that met the key target group eligibility criteria (age 16 to 20) and school drop-out for youth, and AFDC recipient and no child younger than six for AFDC recipients
...Although the control and comparison groups have been defined such that their preprogram characteristics (weighted) are similar, their postprogram earnings paths tend to differ, especially for the youth samples. As seen in Figure 1A, the earnings of the control and comparison group youth are very similar and show small annual increases during the preprogram period (1972 through 1974). However, the earnings paths diverge significantly during the enrollment years (1975 through early 1977) and the follow-up period (1978, 1979), with the comparison group exhibiting a much steeper age-earnings profile than the control group. For the AFDC control group, Figure 1B shows fairly constant earnings levels during the pre-enrollment period, with larger annual increases beginning near the middle of the enrollment period.
...The basic model assumes that earnings are a function of prior earnings, personal characteristics, and environmental factors,'2 as well as program participation. Program impacts are measured by the estimated coefficient on the program participation variable. Underlying this model are two critical assumptions: (1) that the control variables fully account for factors that are correlated with both program participation and the outcome of interest (earnings, in our case); and (2) that the underlying behavioral models of the determinants of earnings are similar for the participant and comparison groups...3 The results,
summarized in Table 1, indicate that, had we been constrained to use comparison group methods for the original Supported Work evaluation and had we chosen the "basic" comparison-group construction procedure and analytic model, we would have arrived at qualitatively similar conclusions to the experimental study findings for AFDC recipients-that the program had relatively large positive effects. However, comparison group methods would have led to quite misleading conclusions about the effects of Supported Work on youth. In essence, while Supported Work led to significant short-run increases in earnings of youth as a result of the program jobs (the 1977 results) and no long-run effect (the 1978 and 1979 results), we would have concluded that Supported Work had significant, large negative effects on the earnings of youth, both during their Supported Work employment period and subsequently.
The results are especially striking for youth: not only do the magnitude of the net-impact estimates from analyses using the control group differ substantially from those based on all of the comparison groups but also, for many of the estimates, the qualitative judgments about the impacts of the demonstration differ among the comparison groups estimates. For example, reading across the top row in Table 3, we see that, relying on the experimental design, the estimated impact on 1977 earnings is a statistically significant increase of $313. The point estimate of the impact is $166 (not statistically significant) using the random sample of youth in the CPS, and the point estimates range from - $388 to - $774 for the other comparison groups, with two of these negative estimates being statistically significant. Based on the two statistical match comparison groups, we would conclude that Supported Work had no effects for any time period; with the long-list cell match sample, we would conclude that the program had negative impacts in all time periods; and with all of the other comparison groups, we would conclude that it had negative impacts in most time periods and no effects in others...For the AFDC sample, the qualitative conclusions one would draw are roughly comparable across the estimates based on the various comparison samples. In each case, the weight of the evidence is that the program had large impacts on earnings during the in-program period and that, although the size of the impacts diminished over time, they were still relatively large as late as 1979. However, the nonexperimental estimates range from 73 percent smaller to 129 percent larger than the experimental estimates in the latter two years and, more often than not, the comparison group estimates are not statistically significant, whereas all of the experimental estimates are statistically significant.
[table 3: of the 6 alternative analyses which do not use the randomized control group but correspond to correlational groups, none of the 12 Youth earning estimates are similar to the 2 randomized Youth earning estimates, and 4 of the 12 AFDC earning estimates are similar to the 2 randomized AFDC earning estimates]
The results presented in Table 4 demonstrate some sensitivity of the impact estimates to the particular analytic model used. However, these results tend to vary less than do those based on different comparison groups. As expected, the estimates based on the control group design are not very sensitive to model specification, taking account of the correspondence between calendar years 1977, 1978, and 1979 used in the simple differences of means and the basic earnings models and the program years used in the fixed effects model.17 Similarly, when the comparison groups are utilized, there is a fair degree of correspondence between the estimated impacts based on simple differences in earning gains and those estimated from the basic earnings model. However, the fixed-effects estimates based upon the comparison groups tend to differ substantially from the estimates generated by the other two analytic models.18 A comparison of estimates based on the comparison group methodology to estimates based on the control group methodology does not clearly show whether one analytic model consistently leads to better impact estimates than the others.
Regardless of the analytic model, the performance of comparison group methods for youth is overwhelmingly poor. Possible explanations are that the earnings models are simply misspecified and/or that the underlying behavioral models differ between the participant and comparison groups in ways that cannot be controlled for statistically. Although using an F-test we could reject the similarity of the underlying models for the youth comparison and control samples in only a few instances, it is notable that the coefficients differ considerably both across samples and over time, suggesting that the models may nevertheless differ (also, refer to Figure 2 above and Fraker and Maynard 1984).
One puzzling result is the consistently better performance of the comparison group methods for AFDC recipients as compared with their performance for youth. One factor that undoubtedly contributes to this differential in performance is the greater heterogeneity among the youth sample, as evidenced by higher variances of earnings in the preprogram period and the higher rate of increase in the variability of earnings over time among youth as compared to AFDC recipients, in general. This implies that there is much more room for biased selection into the program and, hence, the task of defining a comparison group and an analytic model to compensate for the biased selection is more challenging. A second factor that might contribute to the differential in the results between the two groups is the predictability of the time paths of earnings growths between the two groups. We speculate that pre-enrollment earnings are a more powerful predictor of future earnings for the AFDC sample than for the youth sample. If true, this may explain why for the AFDC cases we were able to select better comparison groups and specify analytic models that better control for differences in earnings potential between comparison and experimental cases.
In an independent analysis, LaLonde (1984 and 1986) undertook an examination of the quality of impact estimates generated from comparison group methodologies also using the Supported Work Demonstration data. LaLonde defined comparison groups for two subsets of the Supported Work sample-the AFDC target group and males who enrolled in the youth, ex-addict, or ex-offender target groups-by taking random subsets of AFDC recipients and males, respectively, in the 1976 CPS sample who were in the labor force in March 1976 and whose nominal income in 1975 was less than $20,000 (household income was less than $30,000).22 Using these comparison groups and the Supported Work control group, LaLonde estimated program impacts on annual earnings based on several analytic models: a simple earnings gains model; a difference between postprogram earnings, controlling for preprogram earnings; a model similar to the "basic" earnings model described above, controlling for preprogram earnings and many other observed characteristics; and a "basic" earnings-type of model that includes a participation selection-bias correction factor.23
LaLonde's results corroborate several of the findings from our study. First, he found that when using the control group, the analytic models and econometric methods used have little effect on program impact estimates; but when using the comparison samples, the analytic models affect significantly the impact results. Second, he found that comparison groups work better for AFDC recipients than they do for males. However, it is important to note that he came close to replicating the experimental results for AFDC recipients only with the analysis model that controlled explicitly for the participation decision. LaLonde's results also demonstrate two other important points. First, they show that controlling for preprogram earnings differences is very important and, second, they show that including a nonlinear control for the program participation decision will tend to reduce bias relative to other model specifications.
For several reasons, we believe that the results of this study apply more generally to other program evaluations that must rely on comparison group designs, especially evaluations of programs such as WIN, Work/Welfare, CETA, and JTPA. First, as was discussed above, individuals in the Supported Work youth and AFDC samples are similar to participants in these other programs in that they tend to have experienced severe employment problems. They differ in that both samples have less attachment to the work force than the typical participant in these larger-scale programs and in that a higher proportion of the Supported Work youth sample exhibits characteristics that are associated with exceptionally low levels of employment and earnings-minority ethnic/racial composition, low educational levels, and limited employment experience.24
Second, the key element of the Supported Work treatment (supervised employment) is similar in many respects to that of CETA on-the-job-training, work-experience, and public-service-employment positions. Third, and perhaps most notable, the range of net impact estimates generated for Supported Work using the various comparison groups generally spans the range of estimates from other evaluations using comparison group designs of employment programs targeted on similar segments of the population (youth and disadvantaged women). For example, as seen in Table 5, the program impacts estimated for youth groups (including the Supported Work youth) using comparison group methodologies applied to CPS data are generally large, negative, and often statistically significant; those for welfare recipients are uniformly positive but range widely in magnitude.25
Yet, we have strong evidence based on the control group that Supported Work had no long-term impact for youth and modest positive impacts for welfare recipients.26"
#statistics #economics #causality
You will not be surprised that the regressions spit out completely wrong and even systematically biased estimates (roughly: 0 of 12 estimates were reasonably similar to the causal effect for one job program and 4/12 for another job program; and the estimates were systematically far too negative with the first job program but less biased for the second).
Was the effect of those job programs extremely harmful, harmful, neutral, or positive? If you'd settled for a correlation, you could have wound up being extremely wrong...
"The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs", Fraker & Maynard 1987 https://pdf.yt/d/AmuYucmgh2MghBok / https://www.dropbox.com/s/3799hv83u14yudn/1987-fraker.pdf
"This study investigates empirically the strengths and limitations of using experimental versus nonexperimental designs for evaluating employment and training programs. The assessment involves comparing results from an experimental-design study - the National Supported Work Demonstration - with the estimated impacts of Supported Work based on analyses using comparison groups constructed from the Current Population Surveys. The results indicate that nonexperimental designs cannot be relied on to estimate the effectiveness of employment programs. Impact estimates tend to be sensitive both to the comparison group construction methodology and to the analytic model used. There is currently no way a priori to ensure that the results of comparison group studies will be valid indicators of the program impacts.
Beginning with the OEO-sponsored Negative Income Tax Experiments conducted in the late 1960s and early 1970s, the use of randomized experiments gained wide acceptance in evaluations of health, education, welfare, and labor policies and programs. However, contrary to the strong recommendations of leading labor economists in support of experimentation (see, for example, Ashenfelter 1975), researchers continued to use nonexperimental designs, especially in evaluations of ongoing programs like WIN and CETA and in evaluations with limited funding. Virtually all of the evaluations of the major ongoing federal employment and training programs (i.e., CETA, WIN, and Job Corps) have relied on analytic methodologies that use no comparison group or that have defined comparison groups ex-post from existing sampling frames (see, for example, Ashenfelter 1979, Westat 1980, King and Geraci 1982, Bassi 1984, Bassi et al. 1984, Dickinson et al. 1984, Kiefer 1979, Maller et al. 1982, and Ketron, Inc. 1980). These nonexperimental studies suffer from one major limitation: the integrity of their results rests on untestable assumptions about the adequacy of the analytic model and the unmeasurable characteristics of the participant and comparison groups. Furthermore, the net impact estimates vary widely across studies of a given program due to the use of different model specifications and/or comparison groups. For example, estimates of the net impacts of CETA on the earnings of youth range from large negative impacts to essentially no impacts; those on the earnings of women range from no impacts to large positive impacts; and those on the earnings of adult males range from small positive to large negative impacts (see Barnow 1987, LaLonde and Maynard 1986).
This high variability in the program impact estimates based on comparison group designs has prompted several analyses aimed at assessing the merits of experimental versus nonexperimental designs for employment-training evaluations (see, for example, Ashenfelter and Card 1985, LaLonde 1984, and Burtless and Orr 1986). Our response was to undertake an empirical assessment of the reliability of program impact estimates generated through the nonexperimental methodologies that have been used widely in employment-training evaluations during the past decade. Central to our assessment is a comparison of results from two evaluations of the net impacts of the National Supported Work Demonstration. One set of results is based upon control groups that were selected in accordance with the demonstration's experimental design, while the other set is based upon comparison groups constructed from the Current Population Surveys. The results of our study indicate that nonexperimental design evaluations cannot be relied on to estimate the effectiveness of programs like Supported Work with sufficient precision (and in some cases unbiasedness)
...We demonstrate that program impact estimates may differ substantially between those generated using randomly selected control groups and those generated using comparison groups. We also observe that the impact estimates tend to be sensitive to both the method used to construct comparison groups and the specification of the analytic model.
In conducting this study, we have taken advantage of a unique opportunity provided by a major national experiment, the National Supported Work Demonstration, to explore the adequacy of nonexperimental study designs for evaluating employment and training programs.
The National Supported Work Demonstration, conducted between 1975 and 1979, was a field test of the effectiveness of a highly structured work experience program in mitigating the employment problems of four groups of persons with severe employment disabilities: young school dropouts, AFDC recipients, ex-drug addicts, and ex-offenders. Based on a control-group methodology, Supported Work was found to have increased significantly the employment and earnings of all four target groups during the period of program participation (see Hollister et al. 1984). However, only the AFDC recipients showed evidence of postprogram earnings gains. These longer-term impacts for the AFDC recipients were in the range of 5 to 10 percentage point increases in employment rates and $50 to $80 increases in average monthly earnings.
Our study of the sensitivity of net impact estimates to the evaluation design methodology focuses on the Supported Work Demonstration for three reasons. First, the intervention was similar to the work experience treatments within other employment-training programs. Second, the Supported Work data are sufficiently similar to those used in the previous MDTA and CETA evaluations to permit us to replicate the nonexperimental methodologies used in those prior studies, while offering us the advantage of the control group that can be used to obtain unbiased impact estimates for use as the assessment criteria. Third, there is substantial overlap in the target populations that were served by the Supported Work Demonstration, MDTA, CETA, and currently are served by JTPA.
We further focused our assessment on only two of the four Supported Work target groups: AFDC recipients and youth. These two groups are similar in important respects to youth and adult female participants in MDTA, CETA, and JTPA, and nominally similar individuals can be identified in data sets from which comparison groups might be selected. In contrast, data on the defining attributes of the Supported Work ex-offender and ex-addict target groups - their criminal histories and their drug use - are not available on the data bases that are potential sources of comparison groups.
A. The Supported Work Sample and Data
The Supported Work sample includes 1,244 school dropouts (566 experimentals and 678 controls) ages 17 to 20 years and 1,602 female long-term recipients of AFDC, none of whom had a child younger than age 6 (800 experimentals and 802 controls)...The typical person in the AFDC sample was 34 years old, was black, had ten years of schooling, had two dependents, and had a youngest child between the ages of six and twelve. The average welfare payment was about $280 per month, with an additional average food stamp bonus value of about $70. The women had received welfare for almost nine years on average, and the average length of time since the last regular job for those with some prior work experience was nearly four years (see Masters and Maynard 1984).
The reliability of program impact estimates depends critically on obtaining good estimates of what the outcomes for the participant group would have been had this group not received program services. The best way to obtain such estimates is through the use of an experimental design whereby a random subset of the eligible program applicants is assigned to a no-treatment control group.10
The availability of such a control group is one of the unique features of the National Supported Work Demonstration. In this demonstration, program impacts can be measured quite simply by comparing the mean values of the outcomes for experimentals and controls.
However, as noted above, most employment-training evaluations have not had the benefit of a control group and, therefore, have relied on comparison groups constructed from non0program data bases.
We constructed a "basic" comparison group for both the Supported Work Youth and AFDC samples using cell matching procedures similar to those used in much of the prior CLMS research (see, for example, Westat 1980 and Dickinson et al. 1984). Essentially, we selected cases from the CPS that met the key target group eligibility criteria (age 16 to 20) and school drop-out for youth, and AFDC recipient and no child younger than six for AFDC recipients
...Although the control and comparison groups have been defined such that their preprogram characteristics (weighted) are similar, their postprogram earnings paths tend to differ, especially for the youth samples. As seen in Figure 1A, the earnings of the control and comparison group youth are very similar and show small annual increases during the preprogram period (1972 through 1974). However, the earnings paths diverge significantly during the enrollment years (1975 through early 1977) and the follow-up period (1978, 1979), with the comparison group exhibiting a much steeper age-earnings profile than the control group. For the AFDC control group, Figure 1B shows fairly constant earnings levels during the pre-enrollment period, with larger annual increases beginning near the middle of the enrollment period.
...The basic model assumes that earnings are a function of prior earnings, personal characteristics, and environmental factors,'2 as well as program participation. Program impacts are measured by the estimated coefficient on the program participation variable. Underlying this model are two critical assumptions: (1) that the control variables fully account for factors that are correlated with both program participation and the outcome of interest (earnings, in our case); and (2) that the underlying behavioral models of the determinants of earnings are similar for the participant and comparison groups...3 The results,
summarized in Table 1, indicate that, had we been constrained to use comparison group methods for the original Supported Work evaluation and had we chosen the "basic" comparison-group construction procedure and analytic model, we would have arrived at qualitatively similar conclusions to the experimental study findings for AFDC recipients-that the program had relatively large positive effects. However, comparison group methods would have led to quite misleading conclusions about the effects of Supported Work on youth. In essence, while Supported Work led to significant short-run increases in earnings of youth as a result of the program jobs (the 1977 results) and no long-run effect (the 1978 and 1979 results), we would have concluded that Supported Work had significant, large negative effects on the earnings of youth, both during their Supported Work employment period and subsequently.
The results are especially striking for youth: not only do the magnitude of the net-impact estimates from analyses using the control group differ substantially from those based on all of the comparison groups but also, for many of the estimates, the qualitative judgments about the impacts of the demonstration differ among the comparison groups estimates. For example, reading across the top row in Table 3, we see that, relying on the experimental design, the estimated impact on 1977 earnings is a statistically significant increase of $313. The point estimate of the impact is $166 (not statistically significant) using the random sample of youth in the CPS, and the point estimates range from - $388 to - $774 for the other comparison groups, with two of these negative estimates being statistically significant. Based on the two statistical match comparison groups, we would conclude that Supported Work had no effects for any time period; with the long-list cell match sample, we would conclude that the program had negative impacts in all time periods; and with all of the other comparison groups, we would conclude that it had negative impacts in most time periods and no effects in others...For the AFDC sample, the qualitative conclusions one would draw are roughly comparable across the estimates based on the various comparison samples. In each case, the weight of the evidence is that the program had large impacts on earnings during the in-program period and that, although the size of the impacts diminished over time, they were still relatively large as late as 1979. However, the nonexperimental estimates range from 73 percent smaller to 129 percent larger than the experimental estimates in the latter two years and, more often than not, the comparison group estimates are not statistically significant, whereas all of the experimental estimates are statistically significant.
[table 3: of the 6 alternative analyses which do not use the randomized control group but correspond to correlational groups, none of the 12 Youth earning estimates are similar to the 2 randomized Youth earning estimates, and 4 of the 12 AFDC earning estimates are similar to the 2 randomized AFDC earning estimates]
The results presented in Table 4 demonstrate some sensitivity of the impact estimates to the particular analytic model used. However, these results tend to vary less than do those based on different comparison groups. As expected, the estimates based on the control group design are not very sensitive to model specification, taking account of the correspondence between calendar years 1977, 1978, and 1979 used in the simple differences of means and the basic earnings models and the program years used in the fixed effects model.17 Similarly, when the comparison groups are utilized, there is a fair degree of correspondence between the estimated impacts based on simple differences in earning gains and those estimated from the basic earnings model. However, the fixed-effects estimates based upon the comparison groups tend to differ substantially from the estimates generated by the other two analytic models.18 A comparison of estimates based on the comparison group methodology to estimates based on the control group methodology does not clearly show whether one analytic model consistently leads to better impact estimates than the others.
Regardless of the analytic model, the performance of comparison group methods for youth is overwhelmingly poor. Possible explanations are that the earnings models are simply misspecified and/or that the underlying behavioral models differ between the participant and comparison groups in ways that cannot be controlled for statistically. Although using an F-test we could reject the similarity of the underlying models for the youth comparison and control samples in only a few instances, it is notable that the coefficients differ considerably both across samples and over time, suggesting that the models may nevertheless differ (also, refer to Figure 2 above and Fraker and Maynard 1984).
One puzzling result is the consistently better performance of the comparison group methods for AFDC recipients as compared with their performance for youth. One factor that undoubtedly contributes to this differential in performance is the greater heterogeneity among the youth sample, as evidenced by higher variances of earnings in the preprogram period and the higher rate of increase in the variability of earnings over time among youth as compared to AFDC recipients, in general. This implies that there is much more room for biased selection into the program and, hence, the task of defining a comparison group and an analytic model to compensate for the biased selection is more challenging. A second factor that might contribute to the differential in the results between the two groups is the predictability of the time paths of earnings growths between the two groups. We speculate that pre-enrollment earnings are a more powerful predictor of future earnings for the AFDC sample than for the youth sample. If true, this may explain why for the AFDC cases we were able to select better comparison groups and specify analytic models that better control for differences in earnings potential between comparison and experimental cases.
In an independent analysis, LaLonde (1984 and 1986) undertook an examination of the quality of impact estimates generated from comparison group methodologies also using the Supported Work Demonstration data. LaLonde defined comparison groups for two subsets of the Supported Work sample-the AFDC target group and males who enrolled in the youth, ex-addict, or ex-offender target groups-by taking random subsets of AFDC recipients and males, respectively, in the 1976 CPS sample who were in the labor force in March 1976 and whose nominal income in 1975 was less than $20,000 (household income was less than $30,000).22 Using these comparison groups and the Supported Work control group, LaLonde estimated program impacts on annual earnings based on several analytic models: a simple earnings gains model; a difference between postprogram earnings, controlling for preprogram earnings; a model similar to the "basic" earnings model described above, controlling for preprogram earnings and many other observed characteristics; and a "basic" earnings-type of model that includes a participation selection-bias correction factor.23
LaLonde's results corroborate several of the findings from our study. First, he found that when using the control group, the analytic models and econometric methods used have little effect on program impact estimates; but when using the comparison samples, the analytic models affect significantly the impact results. Second, he found that comparison groups work better for AFDC recipients than they do for males. However, it is important to note that he came close to replicating the experimental results for AFDC recipients only with the analysis model that controlled explicitly for the participation decision. LaLonde's results also demonstrate two other important points. First, they show that controlling for preprogram earnings differences is very important and, second, they show that including a nonlinear control for the program participation decision will tend to reduce bias relative to other model specifications.
For several reasons, we believe that the results of this study apply more generally to other program evaluations that must rely on comparison group designs, especially evaluations of programs such as WIN, Work/Welfare, CETA, and JTPA. First, as was discussed above, individuals in the Supported Work youth and AFDC samples are similar to participants in these other programs in that they tend to have experienced severe employment problems. They differ in that both samples have less attachment to the work force than the typical participant in these larger-scale programs and in that a higher proportion of the Supported Work youth sample exhibits characteristics that are associated with exceptionally low levels of employment and earnings-minority ethnic/racial composition, low educational levels, and limited employment experience.24
Second, the key element of the Supported Work treatment (supervised employment) is similar in many respects to that of CETA on-the-job-training, work-experience, and public-service-employment positions. Third, and perhaps most notable, the range of net impact estimates generated for Supported Work using the various comparison groups generally spans the range of estimates from other evaluations using comparison group designs of employment programs targeted on similar segments of the population (youth and disadvantaged women). For example, as seen in Table 5, the program impacts estimated for youth groups (including the Supported Work youth) using comparison group methodologies applied to CPS data are generally large, negative, and often statistically significant; those for welfare recipients are uniformly positive but range widely in magnitude.25
Yet, we have strong evidence based on the control group that Supported Work had no long-term impact for youth and modest positive impacts for welfare recipients.26"
#statistics #economics #causality
View 5 previous comments
+gwern branwen
It would be TOTALLY unethical to assign people into different communities at random. These are people's lives we are talking about. Not lab rats.
In reality what we did for a proposed second runway when I worked for public health is perform modelling of noise contours from aircraft movements and use standards based on epidemiological studies elsewhere in the world to determine where housing should not be permitted, and where existing housing should receive mitigation measures to reduce the impact of environmental noise.
Sure if natural experiments occur, then great. But this is a commercial operation. You can't shut down a huge part of the economy of a city for a day for the sake of a scientists' whim.
Regarding the "because of an airport" comment. I agree I wouldn't introduce "because of an airport" into any qualitative research... for example in doing qualitative interviewing I found that one person who liked noise afflicted environments. You have to be open to all sorts of things when you run a focus group and aware of your own biases. It is a completely different set of skills to quantitative research.
I'm not aware of the precautionary principle as applied to medicine (not my field) only in relation to environmental pollution. In that case, the precautionary principle says "we don't know quite how hazardous x is, so let's avoid exposure to it".
I do agree however that there is alot of medicine that is merely received wisdom and needs a serious critical review.Oct 11, 2014
> It would be TOTALLY unethical to assign people into different communities at random.
You're right, clearly we should be using approaches which get the right answer a third of the time. If we're lucky. Yes, that's definitely ethical. Very ethical.
> These are people's lives we are talking about. Not lab rats.
So why do you value getting the right answer so little? After all, they aren't worthless lab rats - they're humans.
(But I suppose that's how it goes: if one person dies in research, that's a tragedy; if a million die over subsequent decades as conflicting results come out due to the inherent limits of correlational research, well, that's just a statistic, as Dr Stalin might've said. Everyone's hearts were in the right place.)
> Sure if natural experiments occur, then great. But this is a commercial operation. You can't shut down a huge part of the economy of a city for a day for the sake of a scientists' whim.
Societies and governments shut down plenty of things for lesser reasons.
> I agree I wouldn't introduce "because of an airport" into any qualitative research... for example in doing qualitative interviewing I found that one person who liked noise afflicted environments.
I don't agree and I never said that. My point was that in a discussion of a common epistemological error, you made exactly that error without realizing it. Given that it's common and you just made it, I seriously doubt that you have always been scrupulous about avoiding language with connotations of causality or explicit about the difference.
> In that case, the precautionary principle says "we don't know quite how hazardous x is, so let's avoid exposure to it".
And yet, just as I said, the status quo gets grandfathered in. Where are environmental pollution research's lead experiments? After all, lead was merely being dumped into the skies for to brain damage everyone for decades before it finally got restricted, why wasn't there (and still isn't!) a precise focused experiment to definitively end the debate and establish what level we need to clamp it down to to have no effects on children? Well, the precautionary principle doesn't apply to the status quo, and gosh, an experiment with lead - that might be... 'unethical'.
The precautionary principle solves all the wrong problems and is destructive.
> I do agree however that there is alot of medicine that is merely received wisdom and needs a serious critical review.
From my perspective, the environmental pollution research I've seen is just as bad as medical research; it may suffer less from pre-1900 beliefs and practices, but in exchange it's much more contaminated by politics and quasi-religious motivations. 6 of one, half a dozen of the other.Oct 11, 2014
+gwern branwen There simply isn't the money to do what you want, and if there was, it would be spent on extra hip operations, prenatal care or diabetes treatment, or other pressing needs. In NZ, public health services are penny-pinching.
When the airport negotiations were occurring I was a twenty-something female, with a whole range of other responsibilities, in a room full of male CEOs and the Mayor. It was hard enough to get public health even recognised on the available evidence let alone get funding for further research. There was no funding available for me to draw on... and the legislation available to protect the community was weak. It was as much as I could achieve to get the airport to model noise footprints and agree to some mitigation for the houses that would now be noise afflicted. There was only one runway, so closing it wasn't an option (even if I had sufficient influence lol). We designed a substantial research proposal but couldn't get funding (the research I referred to was financed by me as part of my studies as part of a pilot).
Precautionary principle has clearly been applied differently in your country to New Zealand. e.g. Exposure to lead is highly controlled. However, environmental degradation has accelerated under the current government and there are plans to remove more environmental protections. :-(Oct 11, 2014
> There simply isn't the money to do what you want, and if there was, it would be spent on extra hip operations, prenatal care or diabetes treatment, or other pressing needs.
Yet, there is money to fund the literally millions of scientific papers published each year worldwide (a large fraction of which is US). There is money, but it's going to quantity rather than quality.Oct 12, 2014
+gwern branwen Not in New Zealand. On a per capita basis, we spend a fraction of what other OECD countries spend. :-(Oct 12, 2014
Think of it not as a problem but an opportunity: since so few randomized trials get run in some areas, if NZ preferred to fund those and could get over the 'ethics', it'll reap a lot more citations per $! It's a huge inefficiency in the 'research market' which a small sovereign country could reap! (I'd appeal to saving lives, but I know that doesn't work. Citations and money are what actually make the world go 'round.)Oct 12, 2014