"_p_ Values are not Error Probabilities", Hubbard & Bayarri 2003:
"...researchers erroneously believe that the interpretation of such tests is prescribed by a single coherent theory of statistical inference. This is not the case: Classical statistical testing is an anonymous hybrid of the competing and frequently contradictory approaches formulated by R.A. Fisher on the one hand, and Jerzy Neyman and Egon Pearson on the other. In particular, there is a widespread failure to appreciate the incompatibility of Fisher’s evidential p value with the Type I error rate, α, of Neyman–Pearson statistical orthodoxy. The distinction between evidence (p’s) and error (α’s) is not trivial. Instead, it reflects the fundamental differences between Fisher’s ideas on significance testing and inductive inference, and Neyman–Pearson views of hypothesis testing and inductive behavior. Unfortunately, statistics textbooks tend to inadvertently cobble together elements from both of these schools of thought, thereby perpetuating the confusion. So complete is this misunderstanding over measures of evidence versus error that is not viewed as even being a problem among the vast majority of researchers.
...First, we outline the marked differences in the conceptual foundations of the Fisherian and Neyman–Pearson statistical testing approaches. Whenever possible, we let the protagonists speak for themselves. ...Second, we show how the rival ideas from the two schools of thought have been unintentionally mixed together. Curiously, this has taken place despite the fact that Neyman–Pearson, and not Fisherian, theory is regarded as classical statistical orthodoxy (Hogben 1957; Royall 1997; Spielman 1974). In particular, we illustrate how this mixing of statistical testing methodologies has resulted in widespread confusion over the interpretation of p values (evidential measures) and α levels (measures of error). We demonstrate that this confusion was a problem between the Fisherian and Neyman–Pearson camps, is not uncommon among statisticians, is prevalent in statistics textbooks, and is well nigh universal in the pages of leading (marketing) journals.
...In a sense, Fisher used some kind of casual, generic, unspecified, alternative when computing p values, somehow implicit when identifying the test statistic and “more extreme outcomes” to compute p values, or when talking about the “sensitivity” of an experiment. But he never explicitly defined nor used specific alternative hypotheses. In the merging of the two schools of thought, it is often taken that Fisher’s significance testing implies an alternative hypothesis which is simply the complement of the null, but this is difficult to formalize in general. For example, what is the complement of a N(0,1) model? Is it the mean differing from 0, the variance differing from 1, the model not being Normal? Formally, Fisher only had the null model in mind and wanted to check if the data were compatible with it. In Neyman–Pearson theory, therefore, the researcher chooses a (usually) point null hypothesis and tests it against the alternative hypothesis. Their framework introduced the probabilities of committing two kinds of errors based on considerations regarding the decision criterion, sample size, and effect size. These errors were false rejection (Type I error) and false acceptance (Type II error) of the null hypothesis. The former probability is called α, while the latter probability is designated β.
...Moreover, while Fisher claimed that his significance tests were applicable to single experiments (Johnstone 1987a; Kyburg 1974; Seidenfeld 1979), Neyman–Pearson hypothesis tests do not allow an inference to be made about the outcome of any specific hypothesis that the researcher happens to be investigating. The latter were quite specific about this: “We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis (Neyman and Pearson 1933, pp. 290-291). But since scientists are in the business of gleaning evidence from individual studies, this limitation of Neyman–Pearson theory is severe.
Johnstone (1986) remarks that statistical testing usually follows Neyman–Pearson formally, but Fisher philosophically. For instance, Fisher’s idea of disproving the null hypothesis is taught in tandem with the Neyman–Pearson concepts of alternative hypotheses, Type II errors, and the power of a statistical test. In addition, textbooks descriptions of Neyman–Pearson theory often refer to the Type I error probability as the “significance level” (Goodman 1999; Kempthorne 1976; Royall 1997). As a prime example of the bewilderment arising from the mixing of Fisher’s views on inductive inference with the Neyman–Pearson principle of inductive behavior, consider the widely unappreciated fact that the former’s p value is incompatible with the Neyman–Pearson hypothesis test in which it has become embedded (Goodman 1993). Despite this incompatibility, the upshot of this merger is that the p value is now inextricably entangled with the Type I error rate, α. As a result, most empirical work in the applied sciences is conducted along the following approximate lines: The researcher states the null (H0) and alternative (HA) hypotheses, the Type I error rate/significance level, α, and supposedly—but very rarely—calculates the statistical power of the test (e.g., t). These procedural steps are entirely consistent with Neyman–Pearson convention. Next, the test statistic is computed for the sample data, and in an attempt to have one’s cake and eat it too, an associated p value (significance probability) is determined. The p value is then mistakenly interpreted as a frequency-based Type I error rate, and simultaneously as an incorrect (i.e., p < α) measure of evidence against H0.
Fisher was insistent that the significance level of a test had no ongoing sampling interpretation. With respect to the .05 level, for example, he emphasized that this does not indicate that the researcher “allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained” (Fisher 1929, p. 191). For Fisher, the significance level provided a measure of evidence for the “objective” disbelief in the null hypothesis; it had no long-run frequentist characteristics.
Indeed, interpreting the significance level of a test in terms of a Neyman–Pearson Type I error rate, α, rather than via a p value, infuriated Fisher who complained:
> “In recent times one often-repeated exposition of the tests of significance, by J. Neyman, a writer not closely associated with the development of these tests, seems liable to lead mathematical readers astray, through laying down axiomatically, what is not agreed or generally true, that the level of significance must be equal to the frequency with which the hypothesis is rejected in repeated sampling of any fixed population allowed by hypothesis. This intrusive axiom, which is foreign to the reasoning on which the tests of significance were in fact based seems to be a real bar to progress....” (Fisher 1945, p. 130).
And he periodically reinforced these sentiments: “The attempts that have been made to explain the cogency of tests of significance in scientific research, by reference to supposed frequencies of possible statements, based on them, being right or wrong, thus seem to miss the essential nature of such tests” (Fisher 1959, p. 41). Here, Fisher is categorically denying the equivalence of p values and Neyman– Pearson α levels, i.e., long-run frequencies of rejecting H0 when it is true. Fisher captured a major distinction between his and Neyman–Pearson’s notions of statistical tests when he pronounced:
> “This [Neyman–Pearson] doctrine, which has been very dogmatically asserted, makes a truly marvellous mystery of the tests of significance. On the earlier view, held by all those to whom we owe the first examples of these tests, such a test was logically elementary. It presented the logical disjunction: Either the hypothesis is not true, or an exceptionally rare outcome has occurred” (Fisher 1960, p. 8).
Despite the admonitions about the p value not being an error rate, Casella and Berger (1987, p. 133) voiced their concern that “there are a great many statistically naïve users who are interpreting p values as probabilities of Type I error....” Unfortunately, such misinterpretations are confined not only to the naïve users of statistical tests. On the contrary, Kalbfleisch and Sprott (1976) allege that statisticians commonly make the mistake of equating p values with Type I error rates. And their allegations find ready support in the literature. For example, Gibbons and Pratt (1975, p. 21), in an article titled “P Values: Interpretation and Methodology,” erroneously state: “Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” Barnard (1985, p. 7) is similarly at fault when he remarks, “For those who need to interpret probabilities as [long run] frequencies, a P-value ‘measures’ the possibility of an ‘error of the first kind,’ arising from rejection of H0 when it is in fact true.” Again, Hung, O’Neill, Bauer, and Köhne (1997, p. 12) note that the p value is a measure of evidence against the null hypothesis, but then go on to confuse p values with Type I error rates: “The α level is a preexperiment Type I error rate used to control the probability that the observed P value in the experiment of making an error rejection of H0 when in fact H0 is true is α or less.” Or consider Berger and Sellke’s response to Hinkley’s (1987) comments on their paper: ...In sum, although p’s and α’s have very different meanings, Bayarri and Berger (2000) nevertheless contend that among statisticians there is a near ubiquitous misinterpretation of p values as frequentist error probabilities. And inevitably, this fallacy shows up in statistics textbooks, as when Canavos and Miller (1999, p. 255) stipulate: “If the null hypothesis is true, then a type I error occurs if (due to sampling error) the P-value is less than or equal to α.”
Instead, the Neyman–Pearson framework focuses on decision rules with a priori stated error rates, α and β, which are limiting frequencies based on long-run repeated sampling. If a result falls into the critical region H0 is rejected and HA is accepted, otherwise H0 is accepted and HA is rejected. Interestingly, this last assertion contradicts Fisher’s (1966, p. 16) adage that “the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.” Otherwise expressed, the familiar claim that “one can never accept the null hypothesis, only fail to reject it” is a characteristic of Fisher’s significance test, and not the Neyman–Pearson hypothesis test. In the latter’s paradigm one can indeed “accept” the null hypothesis. ...As noted, Fisher’s use of 5% and 1% levels was similarly adopted, and ultimately institutionalized, by Neyman–Pearson. And Fisher (1959, p. 42) rebuked them for doing so, explaining: “...no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.” Despite this rebuke, it is small wonder that many researchers confuse Fisher’s evidential p values with Neyman–Pearson behavioral error rates when both concepts are commonly employed at the 5% and the 1% levels.
Devore and Peck’s (1993, p. 451) statistics textbook illustrates Goodman’s point: “The smallest α for which H0 could be rejected is determined by the tail area captured by the computed value of the test statistic. This smallest α is the P-value.” Or consider in this context another erroneous passage from a statistics textbook:
> “We sometimes take one final step to assess the evidence against H0. We can compare the P-value with a fixed value that we regard as decisive. This amounts to announcing in advance how much evidence against H0 we will insist on. The decisive value of P is called the **significance level**. We write it as α, the Greek letter alpha” (Moore 2000, p. 326; original emphasis).
..This transition from (and mixing of) α levels to p values is typically seamless, as if it constitutes a natural progression through different parts of the same coherent statistical whole. It is revealed in the following passage from one such textbook: “In the next subsection we illustrate testing a hypothesis by using various values of α, and we see that this leads to defining the p value....” (Bowerman et al., 2001, p. 300)
...To underscore this point, in commenting on various issues surrounding the interpretation of p values, Berger and Sellke (1987, p. 135) unequivocally spelled out that: “These are not dead issues, in the sense of being well known and thoroughly aired long ago; although the issues are not new, we have found the vast majority of statisticians to be largely unaware of them.” (Our emphasis). Schervish’s (1996) article almost a decade later, tellingly entitled “P values: What They Are and What They Are Not,” suggests that confusion remains in this regard within the statistics community.
...two randomly selected issues of each of three leading marketing journals—the Journal of Consumer Research, Journal of Marketing, and Journal of Marketing Research—were analyzed for the eleven-year period 1990 through 2000 in order to assess the number of empirical articles and notes published therein. This procedure yielded a sample of 478 empirical papers. These papers were then examined to see whether classical statistical tests had been used in the data analysis. Some 435, or 91.0%, employed such testing. Although the evidential p value from a significance test violates the orthodox Neyman–Pearson behavioral hypothesis testing schema, Table 1 shows that p values are commonplace in marketing’s empirical literature. Conversely, α levels are in short supply.
...Neyman–Pearson theory has the advantage of its clear interpretation: Of all the tests being carried out around the world at the .05 level, at most 5% of them result in a false rejection of the null. (The frequentist argument does not require repetition of the exact same experiment. See, for instance, Berger 1985, p. 23, and references there). Its main drawback is that the performance of the procedure is always the prespecified level. Reporting the same “error,” .05 say, no matter how incompatible the data seem to be with the null hypothesis is clearly worrisome in applied situations, and hence the appeal of the data- dependent p values in research papers. On the other hand, for quality control problems, a strict Neyman– Pearson analysis is appropriate. The chief methodological advantage of the p value is that it may be taken as a quantitative measure of the “strength of evidence” against the null. However, while p values are very good as relative measures of evidence, they are extremely difficult to interpret as absolute measures. What exactly “evidence” of around, say, .05 (as measured by a p value) means is not clear.
...This can be done trivially on the web, even at the undergraduate level, with an applet available at http://www.stat.duke.edu/~berger . The applet simulates repeated normal testing, retains the tests providing p values in a given range, and counts the proportion of those for which the null is true. The exercise is revealing. For example, if in a long series of tests on, say, no effect of new drugs (against AIDS, baldness, obesity, common cold, cavities, etc.) we assume that about half the drugs are effective (quite a generous assumption), then of all the tests resulting in a p value around .05 it is fairly typical to find that about 50% of them come, in fact, from the null (no effect) and 50% from the alternative. These percentages depend, of course, on the way the alternatives behave, but an absolute lower bound, for any way the alternatives could arise in the situation above, is about 22%. The upshot for applied work is clear. Most notably, about half (or at the very least over 1/5 ) of the times we see a p value around .05, it is actually coming from the null. That is, a p value of .05 provides, at most, very mild evidence against the null. When practitioners (and students) are not aware of this, they very likely interpret a .05 p value as much greater evidence against the null (like 1 in 20).
This is clearly presented in Berger (2002). The intuitive notion behind it is that one should report conditional error probabilities. That is, reports that retain the unambiguous frequency interpretation, but that are allowed to vary with the observed data. The specific proposal is to condition on data that have the same “strength of evidence” as measured by p values. We see this as the ultimate reconciliation between the two opposing camps. Moreover, it has an added bonus: the conditional error probabilities can be interpreted as posterior probabilities of the hypotheses, thus guaranteeing easy computation as well as marked simplifications in sequential scenarios. A very easy, approximate, calibration of p values is given in Sellke, Bayarri, and Berger (2001). It consists of computing, for an observed p value, the quantity (1 + [- e p log(p)] ^ –1) ^ -1 and interpreting this as a lower bound on the conditional Type I error probability. For example, a p value of .05 results in a conditional α of at least .289. This is an extremely simple formula, and it provides the correct order of magnitude for interpreting a p value. (The calibration – e p log(p) can be interpreted as a lower bound on the Bayes factor.)"
See also http://lesswrong.com/lw/g13/against_nhst/
#statistics
"...researchers erroneously believe that the interpretation of such tests is prescribed by a single coherent theory of statistical inference. This is not the case: Classical statistical testing is an anonymous hybrid of the competing and frequently contradictory approaches formulated by R.A. Fisher on the one hand, and Jerzy Neyman and Egon Pearson on the other. In particular, there is a widespread failure to appreciate the incompatibility of Fisher’s evidential p value with the Type I error rate, α, of Neyman–Pearson statistical orthodoxy. The distinction between evidence (p’s) and error (α’s) is not trivial. Instead, it reflects the fundamental differences between Fisher’s ideas on significance testing and inductive inference, and Neyman–Pearson views of hypothesis testing and inductive behavior. Unfortunately, statistics textbooks tend to inadvertently cobble together elements from both of these schools of thought, thereby perpetuating the confusion. So complete is this misunderstanding over measures of evidence versus error that is not viewed as even being a problem among the vast majority of researchers.
...First, we outline the marked differences in the conceptual foundations of the Fisherian and Neyman–Pearson statistical testing approaches. Whenever possible, we let the protagonists speak for themselves. ...Second, we show how the rival ideas from the two schools of thought have been unintentionally mixed together. Curiously, this has taken place despite the fact that Neyman–Pearson, and not Fisherian, theory is regarded as classical statistical orthodoxy (Hogben 1957; Royall 1997; Spielman 1974). In particular, we illustrate how this mixing of statistical testing methodologies has resulted in widespread confusion over the interpretation of p values (evidential measures) and α levels (measures of error). We demonstrate that this confusion was a problem between the Fisherian and Neyman–Pearson camps, is not uncommon among statisticians, is prevalent in statistics textbooks, and is well nigh universal in the pages of leading (marketing) journals.
...In a sense, Fisher used some kind of casual, generic, unspecified, alternative when computing p values, somehow implicit when identifying the test statistic and “more extreme outcomes” to compute p values, or when talking about the “sensitivity” of an experiment. But he never explicitly defined nor used specific alternative hypotheses. In the merging of the two schools of thought, it is often taken that Fisher’s significance testing implies an alternative hypothesis which is simply the complement of the null, but this is difficult to formalize in general. For example, what is the complement of a N(0,1) model? Is it the mean differing from 0, the variance differing from 1, the model not being Normal? Formally, Fisher only had the null model in mind and wanted to check if the data were compatible with it. In Neyman–Pearson theory, therefore, the researcher chooses a (usually) point null hypothesis and tests it against the alternative hypothesis. Their framework introduced the probabilities of committing two kinds of errors based on considerations regarding the decision criterion, sample size, and effect size. These errors were false rejection (Type I error) and false acceptance (Type II error) of the null hypothesis. The former probability is called α, while the latter probability is designated β.
...Moreover, while Fisher claimed that his significance tests were applicable to single experiments (Johnstone 1987a; Kyburg 1974; Seidenfeld 1979), Neyman–Pearson hypothesis tests do not allow an inference to be made about the outcome of any specific hypothesis that the researcher happens to be investigating. The latter were quite specific about this: “We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis (Neyman and Pearson 1933, pp. 290-291). But since scientists are in the business of gleaning evidence from individual studies, this limitation of Neyman–Pearson theory is severe.
Johnstone (1986) remarks that statistical testing usually follows Neyman–Pearson formally, but Fisher philosophically. For instance, Fisher’s idea of disproving the null hypothesis is taught in tandem with the Neyman–Pearson concepts of alternative hypotheses, Type II errors, and the power of a statistical test. In addition, textbooks descriptions of Neyman–Pearson theory often refer to the Type I error probability as the “significance level” (Goodman 1999; Kempthorne 1976; Royall 1997). As a prime example of the bewilderment arising from the mixing of Fisher’s views on inductive inference with the Neyman–Pearson principle of inductive behavior, consider the widely unappreciated fact that the former’s p value is incompatible with the Neyman–Pearson hypothesis test in which it has become embedded (Goodman 1993). Despite this incompatibility, the upshot of this merger is that the p value is now inextricably entangled with the Type I error rate, α. As a result, most empirical work in the applied sciences is conducted along the following approximate lines: The researcher states the null (H0) and alternative (HA) hypotheses, the Type I error rate/significance level, α, and supposedly—but very rarely—calculates the statistical power of the test (e.g., t). These procedural steps are entirely consistent with Neyman–Pearson convention. Next, the test statistic is computed for the sample data, and in an attempt to have one’s cake and eat it too, an associated p value (significance probability) is determined. The p value is then mistakenly interpreted as a frequency-based Type I error rate, and simultaneously as an incorrect (i.e., p < α) measure of evidence against H0.
Fisher was insistent that the significance level of a test had no ongoing sampling interpretation. With respect to the .05 level, for example, he emphasized that this does not indicate that the researcher “allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained” (Fisher 1929, p. 191). For Fisher, the significance level provided a measure of evidence for the “objective” disbelief in the null hypothesis; it had no long-run frequentist characteristics.
Indeed, interpreting the significance level of a test in terms of a Neyman–Pearson Type I error rate, α, rather than via a p value, infuriated Fisher who complained:
> “In recent times one often-repeated exposition of the tests of significance, by J. Neyman, a writer not closely associated with the development of these tests, seems liable to lead mathematical readers astray, through laying down axiomatically, what is not agreed or generally true, that the level of significance must be equal to the frequency with which the hypothesis is rejected in repeated sampling of any fixed population allowed by hypothesis. This intrusive axiom, which is foreign to the reasoning on which the tests of significance were in fact based seems to be a real bar to progress....” (Fisher 1945, p. 130).
And he periodically reinforced these sentiments: “The attempts that have been made to explain the cogency of tests of significance in scientific research, by reference to supposed frequencies of possible statements, based on them, being right or wrong, thus seem to miss the essential nature of such tests” (Fisher 1959, p. 41). Here, Fisher is categorically denying the equivalence of p values and Neyman– Pearson α levels, i.e., long-run frequencies of rejecting H0 when it is true. Fisher captured a major distinction between his and Neyman–Pearson’s notions of statistical tests when he pronounced:
> “This [Neyman–Pearson] doctrine, which has been very dogmatically asserted, makes a truly marvellous mystery of the tests of significance. On the earlier view, held by all those to whom we owe the first examples of these tests, such a test was logically elementary. It presented the logical disjunction: Either the hypothesis is not true, or an exceptionally rare outcome has occurred” (Fisher 1960, p. 8).
Despite the admonitions about the p value not being an error rate, Casella and Berger (1987, p. 133) voiced their concern that “there are a great many statistically naïve users who are interpreting p values as probabilities of Type I error....” Unfortunately, such misinterpretations are confined not only to the naïve users of statistical tests. On the contrary, Kalbfleisch and Sprott (1976) allege that statisticians commonly make the mistake of equating p values with Type I error rates. And their allegations find ready support in the literature. For example, Gibbons and Pratt (1975, p. 21), in an article titled “P Values: Interpretation and Methodology,” erroneously state: “Reporting a P-value, whether exact or within an interval, in effect permits each individual to choose his own level of significance as the maximum tolerable probability of a Type I error.” Barnard (1985, p. 7) is similarly at fault when he remarks, “For those who need to interpret probabilities as [long run] frequencies, a P-value ‘measures’ the possibility of an ‘error of the first kind,’ arising from rejection of H0 when it is in fact true.” Again, Hung, O’Neill, Bauer, and Köhne (1997, p. 12) note that the p value is a measure of evidence against the null hypothesis, but then go on to confuse p values with Type I error rates: “The α level is a preexperiment Type I error rate used to control the probability that the observed P value in the experiment of making an error rejection of H0 when in fact H0 is true is α or less.” Or consider Berger and Sellke’s response to Hinkley’s (1987) comments on their paper: ...In sum, although p’s and α’s have very different meanings, Bayarri and Berger (2000) nevertheless contend that among statisticians there is a near ubiquitous misinterpretation of p values as frequentist error probabilities. And inevitably, this fallacy shows up in statistics textbooks, as when Canavos and Miller (1999, p. 255) stipulate: “If the null hypothesis is true, then a type I error occurs if (due to sampling error) the P-value is less than or equal to α.”
Instead, the Neyman–Pearson framework focuses on decision rules with a priori stated error rates, α and β, which are limiting frequencies based on long-run repeated sampling. If a result falls into the critical region H0 is rejected and HA is accepted, otherwise H0 is accepted and HA is rejected. Interestingly, this last assertion contradicts Fisher’s (1966, p. 16) adage that “the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.” Otherwise expressed, the familiar claim that “one can never accept the null hypothesis, only fail to reject it” is a characteristic of Fisher’s significance test, and not the Neyman–Pearson hypothesis test. In the latter’s paradigm one can indeed “accept” the null hypothesis. ...As noted, Fisher’s use of 5% and 1% levels was similarly adopted, and ultimately institutionalized, by Neyman–Pearson. And Fisher (1959, p. 42) rebuked them for doing so, explaining: “...no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.” Despite this rebuke, it is small wonder that many researchers confuse Fisher’s evidential p values with Neyman–Pearson behavioral error rates when both concepts are commonly employed at the 5% and the 1% levels.
Devore and Peck’s (1993, p. 451) statistics textbook illustrates Goodman’s point: “The smallest α for which H0 could be rejected is determined by the tail area captured by the computed value of the test statistic. This smallest α is the P-value.” Or consider in this context another erroneous passage from a statistics textbook:
> “We sometimes take one final step to assess the evidence against H0. We can compare the P-value with a fixed value that we regard as decisive. This amounts to announcing in advance how much evidence against H0 we will insist on. The decisive value of P is called the **significance level**. We write it as α, the Greek letter alpha” (Moore 2000, p. 326; original emphasis).
..This transition from (and mixing of) α levels to p values is typically seamless, as if it constitutes a natural progression through different parts of the same coherent statistical whole. It is revealed in the following passage from one such textbook: “In the next subsection we illustrate testing a hypothesis by using various values of α, and we see that this leads to defining the p value....” (Bowerman et al., 2001, p. 300)
...To underscore this point, in commenting on various issues surrounding the interpretation of p values, Berger and Sellke (1987, p. 135) unequivocally spelled out that: “These are not dead issues, in the sense of being well known and thoroughly aired long ago; although the issues are not new, we have found the vast majority of statisticians to be largely unaware of them.” (Our emphasis). Schervish’s (1996) article almost a decade later, tellingly entitled “P values: What They Are and What They Are Not,” suggests that confusion remains in this regard within the statistics community.
...two randomly selected issues of each of three leading marketing journals—the Journal of Consumer Research, Journal of Marketing, and Journal of Marketing Research—were analyzed for the eleven-year period 1990 through 2000 in order to assess the number of empirical articles and notes published therein. This procedure yielded a sample of 478 empirical papers. These papers were then examined to see whether classical statistical tests had been used in the data analysis. Some 435, or 91.0%, employed such testing. Although the evidential p value from a significance test violates the orthodox Neyman–Pearson behavioral hypothesis testing schema, Table 1 shows that p values are commonplace in marketing’s empirical literature. Conversely, α levels are in short supply.
...Neyman–Pearson theory has the advantage of its clear interpretation: Of all the tests being carried out around the world at the .05 level, at most 5% of them result in a false rejection of the null. (The frequentist argument does not require repetition of the exact same experiment. See, for instance, Berger 1985, p. 23, and references there). Its main drawback is that the performance of the procedure is always the prespecified level. Reporting the same “error,” .05 say, no matter how incompatible the data seem to be with the null hypothesis is clearly worrisome in applied situations, and hence the appeal of the data- dependent p values in research papers. On the other hand, for quality control problems, a strict Neyman– Pearson analysis is appropriate. The chief methodological advantage of the p value is that it may be taken as a quantitative measure of the “strength of evidence” against the null. However, while p values are very good as relative measures of evidence, they are extremely difficult to interpret as absolute measures. What exactly “evidence” of around, say, .05 (as measured by a p value) means is not clear.
...This can be done trivially on the web, even at the undergraduate level, with an applet available at http://www.stat.duke.edu/~berger . The applet simulates repeated normal testing, retains the tests providing p values in a given range, and counts the proportion of those for which the null is true. The exercise is revealing. For example, if in a long series of tests on, say, no effect of new drugs (against AIDS, baldness, obesity, common cold, cavities, etc.) we assume that about half the drugs are effective (quite a generous assumption), then of all the tests resulting in a p value around .05 it is fairly typical to find that about 50% of them come, in fact, from the null (no effect) and 50% from the alternative. These percentages depend, of course, on the way the alternatives behave, but an absolute lower bound, for any way the alternatives could arise in the situation above, is about 22%. The upshot for applied work is clear. Most notably, about half (or at the very least over 1/5 ) of the times we see a p value around .05, it is actually coming from the null. That is, a p value of .05 provides, at most, very mild evidence against the null. When practitioners (and students) are not aware of this, they very likely interpret a .05 p value as much greater evidence against the null (like 1 in 20).
This is clearly presented in Berger (2002). The intuitive notion behind it is that one should report conditional error probabilities. That is, reports that retain the unambiguous frequency interpretation, but that are allowed to vary with the observed data. The specific proposal is to condition on data that have the same “strength of evidence” as measured by p values. We see this as the ultimate reconciliation between the two opposing camps. Moreover, it has an added bonus: the conditional error probabilities can be interpreted as posterior probabilities of the hypotheses, thus guaranteeing easy computation as well as marked simplifications in sequential scenarios. A very easy, approximate, calibration of p values is given in Sellke, Bayarri, and Berger (2001). It consists of computing, for an observed p value, the quantity (1 + [- e p log(p)] ^ –1) ^ -1 and interpreting this as a lower bound on the conditional Type I error probability. For example, a p value of .05 results in a conditional α of at least .289. This is an extremely simple formula, and it provides the correct order of magnitude for interpreting a p value. (The calibration – e p log(p) can be interpreted as a lower bound on the Bayes factor.)"
See also http://lesswrong.com/lw/g13/against_nhst/
#statistics