"The harm done by tests of significance", Hauer 2004; via http://andrewgelman.com/2013/03/25/the-harm-done-by-tests-of-significance/ ; alas, the paper gives no estimates of how many deaths the delays caused by NHST amount to. Excerpts:
"Three historical episodes in which the application of null hypothesis significance testing (NHST) led to the mis-interpretation of data are described. It is argued that the pervasive use of this statistical ritual impedes the accumulation of knowledge and is unfit for use.
Most university students are taught the rudiments of statistical null hypothesis significance testing (NHST for short). As a result, later in life, either as users of scientific knowledge or as its creators, they tend to regard NHST to be the hallmark of sound science, an effective safeguard against spurious findings. That the logical foundation and the scientific merit of NHST is questioned by prominent statisticians and scientists is not mentioned in text books and courses on introductory statistics; therefore it is not common knowledge. Yet, volumes have been written about the ‘significance controversy’ (see, e.g. books by Chow, 1996; Harlow et al., 1997). I have written about the paralyzing effect of statistical significance on road safety research a long time ago (Hauer, 1983) and did not plan to return to this topic again. However, the road safety literature is a constant reminder of the continuing real harm done by NHST. The harm is that of using sound data to reach unsound conclusions thereby giving sustenance to non-sensical beliefs. In the end, these non-sensical beliefs cause needless loss of life and limb.
Episode 1: the right-turn-on-red story...The practice of allowing right-turn-on-red’ (or RTOR) at signalized intersec tions started in California in 1937 (some say that is startedearlier, in New York City). For a long time, it was frowned upon by engineers in other states who had safety concerns. ...Our story begins in 1976 when a consultant submitted a report about the safety repercussions of RTOR to the Governor and General Assembly of Virginia. The studies then extant were deemed deficient and the consultant did his own before–after study at 20 intersections with the results in Table 1. Looking at the data in Table 1, persons without training in statistics would think that after RTOR was allowed, these intersections were somewhat less safe. However, the consultant concluded, quite correctly, that the change was not statistically significant.
...More published studies followed. One study in 1977 found that there were 19 crashes involving right turning vehicles before and 24 after allowing RTOR and “this increase in accidents in not statistically significant, and therefore it cannot be said that this increase in RTOR accidents is attributable to RTOR”. And so the sequence of small studies all pointing in the same direction but with statistically not significant results continued to accumulate, till that last study which I followed was published in 1983. While 287 crashes to right turning vehicles were expected, 313 were counted. The authors concluded, once again, that there was no significant difference in vehicular crashes. Similarly for pedestrians. In one state, 74 were expected and 92 occurred; in another state 81 were expected and 87 occurred. An yet, the authors concluded “. . . that there is no statistically significant difference . . . (in) pedestrian accidents before and after RTOR. ...After RTOR became nearly universally used in North America, several large data sets became available and the adverse effect of RTOR could be established (Zador et al., 1982; Preusser et al., 1982).
The problem is clear. Researchers obtain real data which, while noisy, time and again point in a certain direction. However, instead of saying: “here is my estimate of the safety effect, here is its precision, and this is how what I found relates to previous findings”, the data is processed by NHST, and the researcher says, correctly but pointlessly: “I cannot be sure that the safety effect is not zero”. Occasionally, the researcher adds, this time incorrectly and unjustifiably, a statement to the effect that: “since the result is not statistically significant, it is best to assume the safety effect to be zero”. In this manner, good data are drained of real content, the direction of empirical conclusions reversed, and ordinary human and scientific reasoning is turned on its head for the sake of a venerable ritual. As to the habit of subjecting the data from each study to the NHST separately, as if no previous knowledge existed, Edwards (1976, p. 180) notes that “it is like trying to sink a battleship by firing lead shot at it for a long time”
...The RTOR story shows how easy it is for laymen to confuse ‘not significant’ in the statistical sense with ‘not important’ in the common sense. The confusion is not merely semantic and is not confined to persons without statistical education. As is evident from the shoulder-paving episode, even the statistically sophisticated believed that a non-rejection of the ‘no-effect’ null hypothesis amounts to some kind of confirmation that the data show that there was no effect
...Episode 2: the safety effect of paving shoulders..The overall impression is what one might expect. That is, that the addition of a two-foot paved shoulder has reduced all crash types, and the addition of a four-foot paved shoulder has reduced crashes even more (except in the ‘same direction’ and the ‘opposite direction’ categories where crashes were so few that the results are simply erratic). The authors provide the following interpretation of the data:
> None of the differences, however, between the actual and expected crash rates were found to be statistically significant (p. 37).
...The authors are right in saying that none of the differences were statistically significant. However, the authors are entirely wrong to spin this into meaning that shoulder paving cannot be justified by its safety effect. They did not do any cost-effect calculation. They seemed to have assumed that because the estimates in the rightmost column of Table 1 are not statistically significant, it is good form to take them to be zero. This makes no sense. It is the estimates in the rightmost column of Table 2, not zero, which represent the most likely safety effect of shoulder paving when based on this study. The absence of statistical significance does not mean and should never be taken to mean that 0 is the most likely estimate....Unfortunately, neither the readers of the professional literature nor many contributors to it are clear about this distinction. Therefore, if what has been presented are instances of mis-application, one must conclude, at least, that NHST is given to common mis-application in the hands of many users. Advocates of NHST could perhaps argue that what is needed are better educated users. This is a vain hope. As will be shown next, not only readers and contributors to learned journals are given to mis-application of the NHST, prominent statisticians suffer from the same affliction.
Episode 3: speed limit increases
... In their paper there is a graph of the monthly time series of fatal crashes from 1975 to 1998 for rural interstates in Arizona and, referring to this graph, the authors say (p. 6) that:
> “We see a significant increase in the level around 1987 but none around 1995. . . . Statistically it is estimated that the 1987 speed-limit increase resulted in a 41% increase in rural interstate crashes an Arizona. There is no statistical evidence that the 1995 speed-limit increase has any additional effect on the number of crashes.”
That is, failure to reject the null hypothesis of zero effect at the 10% level of significance was equated with the absence of statistical evidence for an increase in the expected number of crashes. In all these cases, 0.0 was entered in the table. Thus, the table contains two kinds of entries: either estimates of percentage change when the increase was statistically significant, or 0.0 by NHST convention but unsupported by either data or prior-knowledge when the increase was not statistically significant.
...Similarly if in all 34 states where speed-limits increased in 1995 the expected increase in expected accidents was +11.0% and the estimates had a standard deviation of 15.4% (this being the sample mean of the RMSEs for the 34 states where speed-limits increased) one should expect about eight negative estimates. Six have been observed. That is, the results obtained are perfectly consistent with the possibility that on rural interstates the speed-limit increases in 1987 and in 1995 were associated with an increase in the expected number of accidents in all states. Indeed, this is the hypothesis that is best supported by the data!
Why then do the authors say that their results “cast doubt on the blanket claim that higher speed-limits and higher fatalities are related”? The opposite would have been the more sensible conclusion to draw since the data provided extraordinarily strong evidence that following the 1987 and the 1995 speed-limit increases the expected number of accidents increased in all states. As in the previous episodes, data painted a characteristically fuzzy but reasonable reflection of reality. However, when good data is passed through the NHST filter, a negative tends to emerge; black turns to white and white to black.
The Balkin and Ord paper has been formally discussed (pp. 13–26) by several prominent statisticians (J. Ledolter, M.D. Fotaine, T.T. Qu, K. Zimmerman, C.H. Spiegelman and A. Harvey). They comment extensively about several aspects of the statistical approach. Surprisingly, no question was raised about the use of NHST, about the appropriateness of subjecting each state separately to a NHST, or about the legitimacy of conclusions drawn in this manner. The use of NHST has received no comment."
"Three historical episodes in which the application of null hypothesis significance testing (NHST) led to the mis-interpretation of data are described. It is argued that the pervasive use of this statistical ritual impedes the accumulation of knowledge and is unfit for use.
Most university students are taught the rudiments of statistical null hypothesis significance testing (NHST for short). As a result, later in life, either as users of scientific knowledge or as its creators, they tend to regard NHST to be the hallmark of sound science, an effective safeguard against spurious findings. That the logical foundation and the scientific merit of NHST is questioned by prominent statisticians and scientists is not mentioned in text books and courses on introductory statistics; therefore it is not common knowledge. Yet, volumes have been written about the ‘significance controversy’ (see, e.g. books by Chow, 1996; Harlow et al., 1997). I have written about the paralyzing effect of statistical significance on road safety research a long time ago (Hauer, 1983) and did not plan to return to this topic again. However, the road safety literature is a constant reminder of the continuing real harm done by NHST. The harm is that of using sound data to reach unsound conclusions thereby giving sustenance to non-sensical beliefs. In the end, these non-sensical beliefs cause needless loss of life and limb.
Episode 1: the right-turn-on-red story...The practice of allowing right-turn-on-red’ (or RTOR) at signalized intersec tions started in California in 1937 (some say that is startedearlier, in New York City). For a long time, it was frowned upon by engineers in other states who had safety concerns. ...Our story begins in 1976 when a consultant submitted a report about the safety repercussions of RTOR to the Governor and General Assembly of Virginia. The studies then extant were deemed deficient and the consultant did his own before–after study at 20 intersections with the results in Table 1. Looking at the data in Table 1, persons without training in statistics would think that after RTOR was allowed, these intersections were somewhat less safe. However, the consultant concluded, quite correctly, that the change was not statistically significant.
...More published studies followed. One study in 1977 found that there were 19 crashes involving right turning vehicles before and 24 after allowing RTOR and “this increase in accidents in not statistically significant, and therefore it cannot be said that this increase in RTOR accidents is attributable to RTOR”. And so the sequence of small studies all pointing in the same direction but with statistically not significant results continued to accumulate, till that last study which I followed was published in 1983. While 287 crashes to right turning vehicles were expected, 313 were counted. The authors concluded, once again, that there was no significant difference in vehicular crashes. Similarly for pedestrians. In one state, 74 were expected and 92 occurred; in another state 81 were expected and 87 occurred. An yet, the authors concluded “. . . that there is no statistically significant difference . . . (in) pedestrian accidents before and after RTOR. ...After RTOR became nearly universally used in North America, several large data sets became available and the adverse effect of RTOR could be established (Zador et al., 1982; Preusser et al., 1982).
The problem is clear. Researchers obtain real data which, while noisy, time and again point in a certain direction. However, instead of saying: “here is my estimate of the safety effect, here is its precision, and this is how what I found relates to previous findings”, the data is processed by NHST, and the researcher says, correctly but pointlessly: “I cannot be sure that the safety effect is not zero”. Occasionally, the researcher adds, this time incorrectly and unjustifiably, a statement to the effect that: “since the result is not statistically significant, it is best to assume the safety effect to be zero”. In this manner, good data are drained of real content, the direction of empirical conclusions reversed, and ordinary human and scientific reasoning is turned on its head for the sake of a venerable ritual. As to the habit of subjecting the data from each study to the NHST separately, as if no previous knowledge existed, Edwards (1976, p. 180) notes that “it is like trying to sink a battleship by firing lead shot at it for a long time”
...The RTOR story shows how easy it is for laymen to confuse ‘not significant’ in the statistical sense with ‘not important’ in the common sense. The confusion is not merely semantic and is not confined to persons without statistical education. As is evident from the shoulder-paving episode, even the statistically sophisticated believed that a non-rejection of the ‘no-effect’ null hypothesis amounts to some kind of confirmation that the data show that there was no effect
...Episode 2: the safety effect of paving shoulders..The overall impression is what one might expect. That is, that the addition of a two-foot paved shoulder has reduced all crash types, and the addition of a four-foot paved shoulder has reduced crashes even more (except in the ‘same direction’ and the ‘opposite direction’ categories where crashes were so few that the results are simply erratic). The authors provide the following interpretation of the data:
> None of the differences, however, between the actual and expected crash rates were found to be statistically significant (p. 37).
...The authors are right in saying that none of the differences were statistically significant. However, the authors are entirely wrong to spin this into meaning that shoulder paving cannot be justified by its safety effect. They did not do any cost-effect calculation. They seemed to have assumed that because the estimates in the rightmost column of Table 1 are not statistically significant, it is good form to take them to be zero. This makes no sense. It is the estimates in the rightmost column of Table 2, not zero, which represent the most likely safety effect of shoulder paving when based on this study. The absence of statistical significance does not mean and should never be taken to mean that 0 is the most likely estimate....Unfortunately, neither the readers of the professional literature nor many contributors to it are clear about this distinction. Therefore, if what has been presented are instances of mis-application, one must conclude, at least, that NHST is given to common mis-application in the hands of many users. Advocates of NHST could perhaps argue that what is needed are better educated users. This is a vain hope. As will be shown next, not only readers and contributors to learned journals are given to mis-application of the NHST, prominent statisticians suffer from the same affliction.
Episode 3: speed limit increases
... In their paper there is a graph of the monthly time series of fatal crashes from 1975 to 1998 for rural interstates in Arizona and, referring to this graph, the authors say (p. 6) that:
> “We see a significant increase in the level around 1987 but none around 1995. . . . Statistically it is estimated that the 1987 speed-limit increase resulted in a 41% increase in rural interstate crashes an Arizona. There is no statistical evidence that the 1995 speed-limit increase has any additional effect on the number of crashes.”
That is, failure to reject the null hypothesis of zero effect at the 10% level of significance was equated with the absence of statistical evidence for an increase in the expected number of crashes. In all these cases, 0.0 was entered in the table. Thus, the table contains two kinds of entries: either estimates of percentage change when the increase was statistically significant, or 0.0 by NHST convention but unsupported by either data or prior-knowledge when the increase was not statistically significant.
...Similarly if in all 34 states where speed-limits increased in 1995 the expected increase in expected accidents was +11.0% and the estimates had a standard deviation of 15.4% (this being the sample mean of the RMSEs for the 34 states where speed-limits increased) one should expect about eight negative estimates. Six have been observed. That is, the results obtained are perfectly consistent with the possibility that on rural interstates the speed-limit increases in 1987 and in 1995 were associated with an increase in the expected number of accidents in all states. Indeed, this is the hypothesis that is best supported by the data!
Why then do the authors say that their results “cast doubt on the blanket claim that higher speed-limits and higher fatalities are related”? The opposite would have been the more sensible conclusion to draw since the data provided extraordinarily strong evidence that following the 1987 and the 1995 speed-limit increases the expected number of accidents increased in all states. As in the previous episodes, data painted a characteristically fuzzy but reasonable reflection of reality. However, when good data is passed through the NHST filter, a negative tends to emerge; black turns to white and white to black.
The Balkin and Ord paper has been formally discussed (pp. 13–26) by several prominent statisticians (J. Ledolter, M.D. Fotaine, T.T. Qu, K. Zimmerman, C.H. Spiegelman and A. Harvey). They comment extensively about several aspects of the statistical approach. Surprisingly, no question was raised about the use of NHST, about the appropriateness of subjecting each state separately to a NHST, or about the legitimacy of conclusions drawn in this manner. The use of NHST has received no comment."
The link to the PDF is now broken...Mar 26, 2013
/shrug Gelman did something to it: http://andrewgelman.com/2013/03/25/the-harm-done-by-tests-of-significance/#comment-144039Mar 26, 2013
Mar 29, 2013