I see; the problem is that psychological theories rarely make predictions more specific than greater-than/less-than (only those two categories, since the 'equals' hypothesis is never true in the real world), and hence there's a prior probability of 50% that picking a choice at random will be correct, so even infinite precision confirming one direction or another doesn't provide much evidence for a particular theory over 'random theory', while in physics, predictions are points which have effectively an infinite number of falsifying alternatives (all the other possible reals), and so the prior probability approaches zero and confirmation is indeed worthwhile. Definitely helps to think in Bayesian terms here to understand the argument.
"Theory-testing in psychology and physics: a methodological paradox", Meehl 1967 http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf
"Because physical theories typically predict numerical values,an improvement in experimental precision reduces the tolerance range and hence increases corroborability. In most psychological research, improved power of a statistical design leads to a prior probability approaching 1/2 of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by "success" is very weak,and becomes weaker with increased precision."Statistical significance" plays a logical role in psychology precisely the reverse of its role in physics.This problem is worsened by certain unhealthy tendencies prevalent among psychologists, such as a premium placed on experimental "cuteness" and a free reliance upon ad hoc explanations to avoid refutation.
The puzzle, sufficiently striking (when clearly discerned) to be entitled to the designation "paradox," is the following: In the physical sciences, the usual result of an improvement in experimental design, instrumentation,or numerical mass of data, is to increase the difficulty of the "observational hurdle" which the physical theory of interest must successfully surmount;whereas, in psychology and some of the allied behavior sciences, the usual effect of such improvement in experimental precision is to provide an easier hurdle for the theory to surmount. Hence what we would normally think of as improvements in our experimental method tend (when predictions materialize) to yield stronger corroboration of the theory in physics, since to remain unrefuted the theory must have survived a more difficult test; by contrast, such experimental improvement in psychology typically results in a weaker corroboration of the theory, since it has now been required to survive a more lenient test [3] [9] [10].
On the basis of a substantive psychological theory T in which he is interested, a psychologist derives (often in a rather loose sense of 'derive') the consequence that an observable variable x will differ as between two groups of subjects. Sometimes, as in most problems of clinical or social psychology, the two groups are defined by a property the individuals under study already possess, e.g., social class, sex, diagnosis, or measured I.Q. Sometimes, as is more likely to be the case in such fields as psychopharmacology or psychology of learning, the contrasted groups are defined by the fact that the experimenter has subjected them to different experimental influences, such as a drug, a reward, or a specific kind of social pressure. Whether the contrasted groups are specified by an "experiment of nature" where the investigator takes the organisms as he finds them, or by a true "experiment" in the more usual sense of the word, is not crucial for the present argument; although, as will be seen, the implications of my puzzle for theory-testing are probably more perilous in the former kind of research than in the latter.
According to the substantive theory T, the two groups are expected to differ on variable x, but it is recognized that errors of (a) measurement and (b) random sampling will, in general, produce some observed difference between the averages of the groups studied, even if their total population did not differ in the true value of x [ = mean of x].
On the basis of a substantive psychological theory T in which he is interested, a psychologist derives (often in a rather loose sense of 'derive') the consequence that an observable variable x will differ as between two groups of subjects. Sometimes, as in most problems of clinical or social psychology, the two groups are defined by a property the individuals under study already possess, e.g., social class, sex, diagnosis, or measured I.Q. Sometimes, as is more likely to be the case in such fields as psychopharmacology or psychology of learning, the contrasted groups are defined by the fact that the experimenter has subjected them to different experimental influences, such as a drug, a reward, or a specific kind of social pressure. Whether the contrasted groups are specified by an "experiment of nature" where the investigator takes the organisms as he finds them, or by a true "experiment" in the more usual sense of the word, is not crucial for the present argument; although, as will be seen, the implications of my puzzle for theory-testing are probably more perilous in the former kind of research than in the latter.
According to the substantive theory T, the two groups are expected to differ on variable x, but it is recognized that errors of (a) measurement and (b) random sampling will, in general, produce some observed difference between the averages of the groups studied, even if their total population did not differ in the true value of x [ = mean of x].
Example: We are interested in the question whether girls are brighter than boys (i.e., that ug - Ub = gb > 0). We do not have perfectly reliable measures of intelligence, and we are furthermore not in a position to measure the intelligence of all boys and girls in the hypothetical population about which we desire to make a general statement. Instead we must be content with fallible I.Q. scores, and with a sample of school children drawn from the hypothetical population. Each of these sources of error, measurement error and random sampling error, contributes to an untrustworthiness in the computed value we obtain for the average intelligence Xb of the boys and also for x, that of the girls. If we observe a difference of, say d = 5 I.Q. points in a sample of 100 boys and 100 girls, we must have some method to infer whether this obtained observational difference between the two groups reflects a real difference or one which is merely apparent, i.e., due to the combined effect of errors of measurement and sampling. We do this by means of a "statistical significance test," the mathematics of which is not relevant here, except to say that by combining the principles of probability with a characterization of the procedure by which the samples were constituted, and quantifying the variation in observed intelligence score within each of the two groups being contrasted, it is possible to employ a formula which utilizes the observed averages together with the observed variations and sample sizes so as to answer certain relevant kinds of questions. Among such questions is the following: "If there were, in fact, no real difference in average I.Q. between the population of boys and girls, with what relative frequency would an investigator find a difference-in relation to the observed intragroup variation-of the magnitude our observations have actually found"?
The statistical hypothesis, that there is no population difference between boys and girls in I.Q., which is called the "null hypothesis" ([Ho: 03 = 1 is used to generate a random sampling distribution of the statistic ("t-test") employed in testing the presence of a significant difference. If the observed data would be very improbable on the hypothesis that Ho obtained, we abandon Ho in favor of its alternative. We conclude that since Ho is false, its alternative, i.e., that there exists a real average difference between the sexes, obtains. In the past, it was customary to deal with what may be called the "point-null"hypothesis, which says that there is zero difference between the two averages in the populations. In recent years it has been more explicitly recognized that what is of theoretical interest is not the mere presence of difference (i.e., that Ho is false, i.e., that Pb ,) but rather the presence of a difference in a certain direction (in this case, that P, > POb) It is therefore increasingly frequent that the behavior scientist employs the so-called "directional null hypothesis,"say H,, instead of the point-null hypothesis Ho. If our substantive theory T involves the prediction that the average I.Q. of girls in the entire population exceeds that of boys, we test the alternative to this statistical hypothesis about the population, i.e., that either the average I.Q. of boys exceeds that of girls (H2) or that there is no difference (Ho). That is, we adopt for statistical test (with the anticipation of refuting it) a disjunction of the old-fashioned point-null hypothesis Ho with the hypothesis H12 that Ho is false and it is false in a direction opposite to that implied by our substantive theory. However, this directional null hypothesis (H,02: 14g ? iab), unlike the old-fashioned point-null hypothesis does not generate a theoretically expected distribution, because
(Ho: Mg=9b),
it is not precise, i.e., it does not specify a point-value for the unknown parameter IPgiris -Jboys)
However, we can employ it as we do the point-null hypothesis, by reasoning that if the point-null hypothesis Ho obtained in the state of nature, then an observed difference (in the direction that our substantive theory predicts) of such-and-such magnitude, has a calculable probability;and that calculable probability is an upper bound upon the desired (but unknown) probability based on Ho02: 4g ? Ub. That is to say, if the probability of observed girl-over-boy difference b) arising through random error is p, given the point-null hypothesis g-(agb
then the probability of the observed difference arising randomly given
b, H0o: pUg
any of the point-hypotheses constituting H2: Ug < 4Ub will of course,be less than p.
Hence p is an upper bound on this probability for the inexact directional null hypothesis (H0I2: pg < PUb). Proceeding in this way directs our interest to only one tail of the theoretical random sampling distribution instead of both tails, which has given rise to a certain amount of controversy among statisticians, but that controversy is not relevant here. (For an excellent clarifying discussion, see Kaiser [6]). Suffice it to say that having formulated a directional null hypothesis H02 which is the alternative to the statistical hypothesis of interest Hl, and which includes the point-null hypothesis Ho as one (very unlikely) possibility for the state of nature, we then carry out the experiment with the anticipation of refuting this directional null hypothesis, thereby confirming the alternative statistical hypothesis of interest (H1), and, since H1 in turn was implied by the substantive theory T, of corroborating T. In such a situation we know in advance that we are in danger of making either of two sorts of "errors,"not in the sense of committing scientific mistakes but in the sense of (rationally) inferring what is objectively a false conclusion. If the null hypothesis (point or directional) is in fact true, but due to the combination of measurement and sampling errors we obtain a value which is so improbable upon H2 or HEI that we decide in favor of their alternative H1, we will then have committed what is known as an error of the first kind or Type I Error. An error of the first kind is a statistical inference that the null hypothesis is false, when in the state of nature it is actually true. This means we will have concluded in favor of a statistical statement H1 which flowed as a consequence of our substantive theory T, and therefore we will believe ourselves to have obtained empirical support for T, whereas in reality this statistical conclusion is false and, consequently, such support for the substantive theory is objectively lacking. Measurement and sampling error may, of course, also result in a sampling deviation in the opposite direction; or, the true difference 7 may be so small that even if our sample values were to coincide exactly with the true ones, the sheer algebra of the significance test would not enable us to reach the pre-specified level of statistical significance. If we conclude until further notice that the directional null hypothesis H02 is tenable, on the grounds that we have failed to refute it by our investigation, then we have failed to support its statistical alternative H1, and therefore failed to confirm one of the predictions of the substantive theory T. Retention of the null hypothesis HQ2when it is in fact false is known as an error of the second kind or Type II Error.
In the biological and social sciences there has been widespread adoption of the probabilities.01 or .05 as the allowable theoretical frequency of Type I errors.These values are called the 1%and 5%"levels of significance." It is obvious that there is an inverse relationship between the probabilities of the two kinds of errors, so that if we adopt a significance level which increases the frequency of Type I errors,such a policy will lead to a greater number of claims of statistically significant departure from the null hypothesis;and, therefore, in whatever unknown fraction of all experiments performed the null hypothesis is in reality false, we will more often (correctly) conclude its falsity, i.e., we will thereby be reducing the proportion of Type II errors
It is important to keep clear the distinction between the substantive theory of interest and the statistical hypothesis which is derived from it [2]. In the I.Q. example there was almost no substantive theory or a very impoverished one; i.e., the question being investigated was itself stated as a purely statistical question about the average I.Q. of the two sexes. In the great majority of investigations in psychology the situation is otherwise. Normally, the investigator holds some substantive theory about unconscious mental processes,or physiological or genetic entities, or perceptual structure, or about learning influences in the person's past, or about current social pressures, which contains a great deal more content than the mere statement that the population parameter of an observational variable is greater for one group of individuals than for another. While no competent psychologist is unaware of this obvious distinction between a substantive psychological theory T and a statistical hypothesis H implied by it, in practice there is a tendency to conflate the substantive theory with the statistical hypothesis, thereby illicitly conferring upon T somewhat the same degree of support given H by a successful refutation of the null hypothesis. Hence the investigator,upon finding an observed difference which has
an extremely small probability of occurring on the null hypothesis, gleefully records the tiny probability number "p < .001," and there is a tendency to feel that the extreme smallness of this probability of a Type I error is somehow transferable to a small probability of "making theoretical mistake." It is as if, when the observed statistical result would be expected to arise only once in a thousand times through a Type I statistical error given Ho, therefore one's substantive theory T, which entails the alternative H1, has received some sort of direct quantitative support of magnitude around .999 [ 1- .001].
One reason why the directional null hypothesis(Ho12 jug < Pb) is the appropriate candidate for experimental refutation is the universal agreement that the old point-null hypothesis Ho jUcg= Mb) is [quasi-] always false in biological and social science. Any dependent variable of interest, such as I.Q., or academic achievement, or perceptual speed, or emotional reactivity as measured by skin resistance,or whatever, depends mainly upon a finite number of "strong"variables characteristic of the organisms studied (embodying the accumulated results of their genetic makeup and their learning histories) plus the influences manipulated by the experimenter. Upon some complicated, unknown mathematical function of this finite list of "'important"determiners then superimposed an indefinitely large number of essentially "random"
factors which contribute to the intragroup variation and therefore boost the error term of the statistical-significance test. In order for two groups which differ in some identified properties (such as social class, intelligence, diagnosis, racial or religious background) to differ not at all in the "output"variable of interest, it would be necessary that all determiners of the output variable have precisely the same average values in both groups, or else that their values should differ by a pattern of amounts of difference which precisely counterbalance one another to yield a net difference of zero. Now our general background knowledge in the social sciences, or, for that matter, even "commonsense" considerations,makes such an exact equality of all determining variables, or a precise "accidental"counterbalancing of them, so extremely unlikely that no psychologist or statistician would assign more than a negligibly small probability to such a state of affairs.
Everyone familiar with psychological research knows that numerous "puzzling, unexpected" correlations pop up all the time, and that it requires only a moderate amount of motivation-plus-ingenuity to construct very plausible alternative theoretical explanations for them. These armchair considerations are borne out by the finding that in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant. See, for example, the papers by Bakan [1] and Nunnally [8]. Data currently being analyzed by Dr. David Lykken and myself, derived from a huge sample of over 55,000 Minnesota high school seniors, reveal statistically significant relationships in 91%of pairwise associations among a congeries of 45 miscellaneous variables such as sex, birth order, religious preference, number of siblings, vocational choice, club membership, college choice, mother's education, dancing, interest in woodworking, liking for school, and the like. The 9%of non-significant associations are heavily concentrated among a small minority of variables having dubious reliability, or involving arbitrary groupings of non-homogeneous or non-monotonic sub-categories. The majority of variables exhibited significant relationships with all but three of the others, often at a very high confidence level (p<10^-6).
So that, for example, while there is no very "interesting" psychological theory that links hunger drive with color-naming ability, I myself would confidently predict a significant difference in color-naming ability between persons tested after a full meal and persons who had not eaten for 10 hours, provided the sample size were sufficiently large and the color-naming measurements sufficiently reliable, since one of the effects of the increased hunger drive is heightened "arousal,"and anything which heightens arousal would be expected to affect a perceptual-cognitive performance like color-naming.
Let us now conceive of a large "theoretical urn" containing counters designating the indefinitely large class of actual and possible substantive theories concerning a certain domain of psychology (e.g., mammalian instrumental learning). Let us conceive of a second urn, the "experimental-design" urn, containing counters designating the indefinitely large set of possible experimental situations which the ingenuity of man could devise. (If anyone should object to my conceptualizing, for purposes of methodological analysis, such a heterogeneous class of theories or experiments,I need only remind him that such a class is universally presupposed in the logic of statistical significance testing.) Since the point-null hypothesis Ho is [quasi-] always false, almost every one of these experimental situations involves a non-zero difference on its output variable (parameter). Whichever group we (arbitrarily) designate as the "experimental"group and the "control"group, in half of these experimental settings the true value of the dependent variable difference (experimental minus control) will be positive, and in the other half negative.
...
We now perform a random pairing of the counters from the "theory"urn with the counters from the "experimental"urn, and arbitrarily stipulate-quite irrationallythat a "successful"outcome of the experiment means that the difference favors the experimental group [#uE - #e > 0]. This preposterous model, which is of course much worse than anything that can exist even in the most primitive of the social sciences, provides us with a lower bound for the expected frequency of a theory's successfully predicting the direction in which the null hypothesis fails, in the state of nature (i.e., we are here not considering sampling problems, and therefore we neglect errors of either the first or the second kind). It is obvious that if the point-null hypothesis Ho is [quasi]always false, and there is no logical connection between our theories and the direction of the experimental outcomes, then if we arbitrarily assign one of the two directional hypotheses H, or H2 to each theory, that hypothesis will be correct half of the time, i.e., in half of the arbitrary urn-counter-pairings. Since even my late, uneducated grandmother's common-sense psychological theories had non-zero verisimilitude,we can safely say that the value is a lower bound on the success-frequency of experimental"tests,"assuming p = 1/2 our experimental design had perfect power.
...I conclude that the effect of increased precision, whether achieved by improved instrumentation and control, greater sensitivity in the logical structure of the experiment, or increasing the number of observations,is to yield a probability approaching 1/2 of corroborating our substantive theory by a significance test, even if the theory is totally without merit.
...It seems unlikely that most social science investigators would think in their usual way about a theory in meteorology which "successfully predicted" that it would rain on the 17th of April, given the antecedent information that it rains (on the average) during half the days in the month of April!
...Many experimental articles in the behavioral sciences, and, even more strangely, review articles which purport to survey the current status of a particular theory in the light of all available evidence, treat the confirming instances and the disconfirming instances with equal methodological respect, as if one could, so to speak, "Count noses," so that if a theory has somewhat more confirming than disconfirming instances, it is in pretty good shape evidentially. Since we know that this is already grossly incorrect on purely formal grounds, it is a mistake a fortiori when the so-called "confirming instances"have themselves a prior probability, as argued above, somewhere in the neighborhood of 3A,quite apart from any theoretical considerations.
This methodological paradox would exist for the psychologist even if he played his own statistical game fairly. The reason for its existence is obvious, namely, that most psychological theories, especially in the so-called "soft" fields such as social and personality psychology, are not quantitatively developed to the extent of being able to generate point-predictions. In this respect, then, although this state of affairs is surely unsatisfactory from the methodological point of view, and stands in great need of clarification (and, hopefully, of constructive suggestions for improving it) from logicians and philosophers of science, one might say that it is "nobody's fault," it being difficult to see just how the behavior scientist could extricate himself from this dilemma without making unrealistic attempts at the premature construction of theories which are sufficiently quantified to generate point-predictions for refutation. However, there are five social forces and intellectual traditions at work in the behavior sciences which make the research consequences of this situation even worse than they may have to be, considering the state of our knowledge. In addition to (a) failure to recognize the marked evidential asymmetry between confirmation and modus tollens refutation of theories, and (b) inadequate appreciation of
the extreme weakness of the hurdle provided by the mere directional significance114 test, there exists among psychologists (c) a fairly widespread tendency to report experimental findings with a liberal use of ad hoc explanations for those that didn't "pan out." This last methodological sin is especially tempting in the "soft"fields of (personality and social) psychology, where the profession highly rewards a kind of "cuteness"or "cleverness"in experimental design, such as a hitherto untried method for inducing a desired emotional state, or a particularly"subtle"gimmick for detecting its influence upon behavioral output. The methodological price paid for this highly-valued "cuteness"is, of course, (d) an unusual ease of escape from modus tollens refutation. For, the logical structure of the "cute" component typically involves use of complex and rather dubious auxiliary assumptions,which are required to mediate the original prediction and are therefore readily available as (genuinely) plausible "outs"when the prediction fails. It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments,in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter's modus tollens impact on the theory) becomes the focus of interest in Experiment2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of "an integrated research program,"without ever ,once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence.
Meanwhile our eager-beaver researcher,undismayed by logic-of-science considerations and relying blissfully on the "exactitude"of modern statistical hypothesis-testing,has produced a long publication list and been promoted to a full professorship.
Lest the philosophical reader wonder (quite appropriately) whether these impressions of the psychological literature ought perhaps to be dismissed as mere"sour grapes"from an embittered, low-publication psychologist manque,it may be stated that the author(a past president of the American Psychological Association)has published over 70 technical books or article sin both "hard"and "soft"fields of psychology, is a recipient of Scientific Contributor Award,also of the Distinguished the Association's has been elected to Fellowship in the Contributor Award of the Division of Clinical Psychology, American Academy of Arts and Sciences,and is actively engaged in both theoretical and empirical research at the present time. He's not mad at anybody-but he is a bit distressed at the state of psychology."
#psychology #statistics #philosophyofscience #bayesian
"Theory-testing in psychology and physics: a methodological paradox", Meehl 1967 http://www.fisme.science.uu.nl/staff/christianb/downloads/meehl1967.pdf
"Because physical theories typically predict numerical values,an improvement in experimental precision reduces the tolerance range and hence increases corroborability. In most psychological research, improved power of a statistical design leads to a prior probability approaching 1/2 of finding a significant difference in the theoretically predicted direction. Hence the corroboration yielded by "success" is very weak,and becomes weaker with increased precision."Statistical significance" plays a logical role in psychology precisely the reverse of its role in physics.This problem is worsened by certain unhealthy tendencies prevalent among psychologists, such as a premium placed on experimental "cuteness" and a free reliance upon ad hoc explanations to avoid refutation.
The puzzle, sufficiently striking (when clearly discerned) to be entitled to the designation "paradox," is the following: In the physical sciences, the usual result of an improvement in experimental design, instrumentation,or numerical mass of data, is to increase the difficulty of the "observational hurdle" which the physical theory of interest must successfully surmount;whereas, in psychology and some of the allied behavior sciences, the usual effect of such improvement in experimental precision is to provide an easier hurdle for the theory to surmount. Hence what we would normally think of as improvements in our experimental method tend (when predictions materialize) to yield stronger corroboration of the theory in physics, since to remain unrefuted the theory must have survived a more difficult test; by contrast, such experimental improvement in psychology typically results in a weaker corroboration of the theory, since it has now been required to survive a more lenient test [3] [9] [10].
On the basis of a substantive psychological theory T in which he is interested, a psychologist derives (often in a rather loose sense of 'derive') the consequence that an observable variable x will differ as between two groups of subjects. Sometimes, as in most problems of clinical or social psychology, the two groups are defined by a property the individuals under study already possess, e.g., social class, sex, diagnosis, or measured I.Q. Sometimes, as is more likely to be the case in such fields as psychopharmacology or psychology of learning, the contrasted groups are defined by the fact that the experimenter has subjected them to different experimental influences, such as a drug, a reward, or a specific kind of social pressure. Whether the contrasted groups are specified by an "experiment of nature" where the investigator takes the organisms as he finds them, or by a true "experiment" in the more usual sense of the word, is not crucial for the present argument; although, as will be seen, the implications of my puzzle for theory-testing are probably more perilous in the former kind of research than in the latter.
According to the substantive theory T, the two groups are expected to differ on variable x, but it is recognized that errors of (a) measurement and (b) random sampling will, in general, produce some observed difference between the averages of the groups studied, even if their total population did not differ in the true value of x [ = mean of x].
On the basis of a substantive psychological theory T in which he is interested, a psychologist derives (often in a rather loose sense of 'derive') the consequence that an observable variable x will differ as between two groups of subjects. Sometimes, as in most problems of clinical or social psychology, the two groups are defined by a property the individuals under study already possess, e.g., social class, sex, diagnosis, or measured I.Q. Sometimes, as is more likely to be the case in such fields as psychopharmacology or psychology of learning, the contrasted groups are defined by the fact that the experimenter has subjected them to different experimental influences, such as a drug, a reward, or a specific kind of social pressure. Whether the contrasted groups are specified by an "experiment of nature" where the investigator takes the organisms as he finds them, or by a true "experiment" in the more usual sense of the word, is not crucial for the present argument; although, as will be seen, the implications of my puzzle for theory-testing are probably more perilous in the former kind of research than in the latter.
According to the substantive theory T, the two groups are expected to differ on variable x, but it is recognized that errors of (a) measurement and (b) random sampling will, in general, produce some observed difference between the averages of the groups studied, even if their total population did not differ in the true value of x [ = mean of x].
Example: We are interested in the question whether girls are brighter than boys (i.e., that ug - Ub = gb > 0). We do not have perfectly reliable measures of intelligence, and we are furthermore not in a position to measure the intelligence of all boys and girls in the hypothetical population about which we desire to make a general statement. Instead we must be content with fallible I.Q. scores, and with a sample of school children drawn from the hypothetical population. Each of these sources of error, measurement error and random sampling error, contributes to an untrustworthiness in the computed value we obtain for the average intelligence Xb of the boys and also for x, that of the girls. If we observe a difference of, say d = 5 I.Q. points in a sample of 100 boys and 100 girls, we must have some method to infer whether this obtained observational difference between the two groups reflects a real difference or one which is merely apparent, i.e., due to the combined effect of errors of measurement and sampling. We do this by means of a "statistical significance test," the mathematics of which is not relevant here, except to say that by combining the principles of probability with a characterization of the procedure by which the samples were constituted, and quantifying the variation in observed intelligence score within each of the two groups being contrasted, it is possible to employ a formula which utilizes the observed averages together with the observed variations and sample sizes so as to answer certain relevant kinds of questions. Among such questions is the following: "If there were, in fact, no real difference in average I.Q. between the population of boys and girls, with what relative frequency would an investigator find a difference-in relation to the observed intragroup variation-of the magnitude our observations have actually found"?
The statistical hypothesis, that there is no population difference between boys and girls in I.Q., which is called the "null hypothesis" ([Ho: 03 = 1 is used to generate a random sampling distribution of the statistic ("t-test") employed in testing the presence of a significant difference. If the observed data would be very improbable on the hypothesis that Ho obtained, we abandon Ho in favor of its alternative. We conclude that since Ho is false, its alternative, i.e., that there exists a real average difference between the sexes, obtains. In the past, it was customary to deal with what may be called the "point-null"hypothesis, which says that there is zero difference between the two averages in the populations. In recent years it has been more explicitly recognized that what is of theoretical interest is not the mere presence of difference (i.e., that Ho is false, i.e., that Pb ,) but rather the presence of a difference in a certain direction (in this case, that P, > POb) It is therefore increasingly frequent that the behavior scientist employs the so-called "directional null hypothesis,"say H,, instead of the point-null hypothesis Ho. If our substantive theory T involves the prediction that the average I.Q. of girls in the entire population exceeds that of boys, we test the alternative to this statistical hypothesis about the population, i.e., that either the average I.Q. of boys exceeds that of girls (H2) or that there is no difference (Ho). That is, we adopt for statistical test (with the anticipation of refuting it) a disjunction of the old-fashioned point-null hypothesis Ho with the hypothesis H12 that Ho is false and it is false in a direction opposite to that implied by our substantive theory. However, this directional null hypothesis (H,02: 14g ? iab), unlike the old-fashioned point-null hypothesis does not generate a theoretically expected distribution, because
(Ho: Mg=9b),
it is not precise, i.e., it does not specify a point-value for the unknown parameter IPgiris -Jboys)
However, we can employ it as we do the point-null hypothesis, by reasoning that if the point-null hypothesis Ho obtained in the state of nature, then an observed difference (in the direction that our substantive theory predicts) of such-and-such magnitude, has a calculable probability;and that calculable probability is an upper bound upon the desired (but unknown) probability based on Ho02: 4g ? Ub. That is to say, if the probability of observed girl-over-boy difference b) arising through random error is p, given the point-null hypothesis g-(agb
then the probability of the observed difference arising randomly given
b, H0o: pUg
any of the point-hypotheses constituting H2: Ug < 4Ub will of course,be less than p.
Hence p is an upper bound on this probability for the inexact directional null hypothesis (H0I2: pg < PUb). Proceeding in this way directs our interest to only one tail of the theoretical random sampling distribution instead of both tails, which has given rise to a certain amount of controversy among statisticians, but that controversy is not relevant here. (For an excellent clarifying discussion, see Kaiser [6]). Suffice it to say that having formulated a directional null hypothesis H02 which is the alternative to the statistical hypothesis of interest Hl, and which includes the point-null hypothesis Ho as one (very unlikely) possibility for the state of nature, we then carry out the experiment with the anticipation of refuting this directional null hypothesis, thereby confirming the alternative statistical hypothesis of interest (H1), and, since H1 in turn was implied by the substantive theory T, of corroborating T. In such a situation we know in advance that we are in danger of making either of two sorts of "errors,"not in the sense of committing scientific mistakes but in the sense of (rationally) inferring what is objectively a false conclusion. If the null hypothesis (point or directional) is in fact true, but due to the combination of measurement and sampling errors we obtain a value which is so improbable upon H2 or HEI that we decide in favor of their alternative H1, we will then have committed what is known as an error of the first kind or Type I Error. An error of the first kind is a statistical inference that the null hypothesis is false, when in the state of nature it is actually true. This means we will have concluded in favor of a statistical statement H1 which flowed as a consequence of our substantive theory T, and therefore we will believe ourselves to have obtained empirical support for T, whereas in reality this statistical conclusion is false and, consequently, such support for the substantive theory is objectively lacking. Measurement and sampling error may, of course, also result in a sampling deviation in the opposite direction; or, the true difference 7 may be so small that even if our sample values were to coincide exactly with the true ones, the sheer algebra of the significance test would not enable us to reach the pre-specified level of statistical significance. If we conclude until further notice that the directional null hypothesis H02 is tenable, on the grounds that we have failed to refute it by our investigation, then we have failed to support its statistical alternative H1, and therefore failed to confirm one of the predictions of the substantive theory T. Retention of the null hypothesis HQ2when it is in fact false is known as an error of the second kind or Type II Error.
In the biological and social sciences there has been widespread adoption of the probabilities.01 or .05 as the allowable theoretical frequency of Type I errors.These values are called the 1%and 5%"levels of significance." It is obvious that there is an inverse relationship between the probabilities of the two kinds of errors, so that if we adopt a significance level which increases the frequency of Type I errors,such a policy will lead to a greater number of claims of statistically significant departure from the null hypothesis;and, therefore, in whatever unknown fraction of all experiments performed the null hypothesis is in reality false, we will more often (correctly) conclude its falsity, i.e., we will thereby be reducing the proportion of Type II errors
It is important to keep clear the distinction between the substantive theory of interest and the statistical hypothesis which is derived from it [2]. In the I.Q. example there was almost no substantive theory or a very impoverished one; i.e., the question being investigated was itself stated as a purely statistical question about the average I.Q. of the two sexes. In the great majority of investigations in psychology the situation is otherwise. Normally, the investigator holds some substantive theory about unconscious mental processes,or physiological or genetic entities, or perceptual structure, or about learning influences in the person's past, or about current social pressures, which contains a great deal more content than the mere statement that the population parameter of an observational variable is greater for one group of individuals than for another. While no competent psychologist is unaware of this obvious distinction between a substantive psychological theory T and a statistical hypothesis H implied by it, in practice there is a tendency to conflate the substantive theory with the statistical hypothesis, thereby illicitly conferring upon T somewhat the same degree of support given H by a successful refutation of the null hypothesis. Hence the investigator,upon finding an observed difference which has
an extremely small probability of occurring on the null hypothesis, gleefully records the tiny probability number "p < .001," and there is a tendency to feel that the extreme smallness of this probability of a Type I error is somehow transferable to a small probability of "making theoretical mistake." It is as if, when the observed statistical result would be expected to arise only once in a thousand times through a Type I statistical error given Ho, therefore one's substantive theory T, which entails the alternative H1, has received some sort of direct quantitative support of magnitude around .999 [ 1- .001].
One reason why the directional null hypothesis(Ho12 jug < Pb) is the appropriate candidate for experimental refutation is the universal agreement that the old point-null hypothesis Ho jUcg= Mb) is [quasi-] always false in biological and social science. Any dependent variable of interest, such as I.Q., or academic achievement, or perceptual speed, or emotional reactivity as measured by skin resistance,or whatever, depends mainly upon a finite number of "strong"variables characteristic of the organisms studied (embodying the accumulated results of their genetic makeup and their learning histories) plus the influences manipulated by the experimenter. Upon some complicated, unknown mathematical function of this finite list of "'important"determiners then superimposed an indefinitely large number of essentially "random"
factors which contribute to the intragroup variation and therefore boost the error term of the statistical-significance test. In order for two groups which differ in some identified properties (such as social class, intelligence, diagnosis, racial or religious background) to differ not at all in the "output"variable of interest, it would be necessary that all determiners of the output variable have precisely the same average values in both groups, or else that their values should differ by a pattern of amounts of difference which precisely counterbalance one another to yield a net difference of zero. Now our general background knowledge in the social sciences, or, for that matter, even "commonsense" considerations,makes such an exact equality of all determining variables, or a precise "accidental"counterbalancing of them, so extremely unlikely that no psychologist or statistician would assign more than a negligibly small probability to such a state of affairs.
Everyone familiar with psychological research knows that numerous "puzzling, unexpected" correlations pop up all the time, and that it requires only a moderate amount of motivation-plus-ingenuity to construct very plausible alternative theoretical explanations for them. These armchair considerations are borne out by the finding that in psychological and sociological investigations involving very large numbers of subjects, it is regularly found that almost all correlations or differences between means are statistically significant. See, for example, the papers by Bakan [1] and Nunnally [8]. Data currently being analyzed by Dr. David Lykken and myself, derived from a huge sample of over 55,000 Minnesota high school seniors, reveal statistically significant relationships in 91%of pairwise associations among a congeries of 45 miscellaneous variables such as sex, birth order, religious preference, number of siblings, vocational choice, club membership, college choice, mother's education, dancing, interest in woodworking, liking for school, and the like. The 9%of non-significant associations are heavily concentrated among a small minority of variables having dubious reliability, or involving arbitrary groupings of non-homogeneous or non-monotonic sub-categories. The majority of variables exhibited significant relationships with all but three of the others, often at a very high confidence level (p<10^-6).
So that, for example, while there is no very "interesting" psychological theory that links hunger drive with color-naming ability, I myself would confidently predict a significant difference in color-naming ability between persons tested after a full meal and persons who had not eaten for 10 hours, provided the sample size were sufficiently large and the color-naming measurements sufficiently reliable, since one of the effects of the increased hunger drive is heightened "arousal,"and anything which heightens arousal would be expected to affect a perceptual-cognitive performance like color-naming.
Let us now conceive of a large "theoretical urn" containing counters designating the indefinitely large class of actual and possible substantive theories concerning a certain domain of psychology (e.g., mammalian instrumental learning). Let us conceive of a second urn, the "experimental-design" urn, containing counters designating the indefinitely large set of possible experimental situations which the ingenuity of man could devise. (If anyone should object to my conceptualizing, for purposes of methodological analysis, such a heterogeneous class of theories or experiments,I need only remind him that such a class is universally presupposed in the logic of statistical significance testing.) Since the point-null hypothesis Ho is [quasi-] always false, almost every one of these experimental situations involves a non-zero difference on its output variable (parameter). Whichever group we (arbitrarily) designate as the "experimental"group and the "control"group, in half of these experimental settings the true value of the dependent variable difference (experimental minus control) will be positive, and in the other half negative.
...
We now perform a random pairing of the counters from the "theory"urn with the counters from the "experimental"urn, and arbitrarily stipulate-quite irrationally
...I conclude that the effect of increased precision, whether achieved by improved instrumentation and control, greater sensitivity in the logical structure of the experiment, or increasing the number of observations,is to yield a probability approaching 1/2 of corroborating our substantive theory by a significance test, even if the theory is totally without merit.
...It seems unlikely that most social science investigators would think in their usual way about a theory in meteorology which "successfully predicted" that it would rain on the 17th of April, given the antecedent information that it rains (on the average) during half the days in the month of April!
...Many experimental articles in the behavioral sciences, and, even more strangely, review articles which purport to survey the current status of a particular theory in the light of all available evidence, treat the confirming instances and the disconfirming instances with equal methodological respect, as if one could, so to speak, "Count noses," so that if a theory has somewhat more confirming than disconfirming instances, it is in pretty good shape evidentially. Since we know that this is already grossly incorrect on purely formal grounds, it is a mistake a fortiori when the so-called "confirming instances"have themselves a prior probability, as argued above, somewhere in the neighborhood of 3A,quite apart from any theoretical considerations.
This methodological paradox would exist for the psychologist even if he played his own statistical game fairly. The reason for its existence is obvious, namely, that most psychological theories, especially in the so-called "soft" fields such as social and personality psychology, are not quantitatively developed to the extent of being able to generate point-predictions. In this respect, then, although this state of affairs is surely unsatisfactory from the methodological point of view, and stands in great need of clarification (and, hopefully, of constructive suggestions for improving it) from logicians and philosophers of science, one might say that it is "nobody's fault," it being difficult to see just how the behavior scientist could extricate himself from this dilemma without making unrealistic attempts at the premature construction of theories which are sufficiently quantified to generate point-predictions for refutation. However, there are five social forces and intellectual traditions at work in the behavior sciences which make the research consequences of this situation even worse than they may have to be, considering the state of our knowledge. In addition to (a) failure to recognize the marked evidential asymmetry between confirmation and modus tollens refutation of theories, and (b) inadequate appreciation of
the extreme weakness of the hurdle provided by the mere directional significance114 test, there exists among psychologists (c) a fairly widespread tendency to report experimental findings with a liberal use of ad hoc explanations for those that didn't "pan out." This last methodological sin is especially tempting in the "soft"fields of (personality and social) psychology, where the profession highly rewards a kind of "cuteness"or "cleverness"in experimental design, such as a hitherto untried method for inducing a desired emotional state, or a particularly"subtle"gimmick for detecting its influence upon behavioral output. The methodological price paid for this highly-valued "cuteness"is, of course, (d) an unusual ease of escape from modus tollens refutation. For, the logical structure of the "cute" component typically involves use of complex and rather dubious auxiliary assumptions,which are required to mediate the original prediction and are therefore readily available as (genuinely) plausible "outs"when the prediction fails. It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments,in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter's modus tollens impact on the theory) becomes the focus of interest in Experiment2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of "an integrated research program,"without ever ,once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence.
Meanwhile our eager-beaver researcher,undismayed by logic-of-science considerations and relying blissfully on the "exactitude"of modern statistical hypothesis-testing,has produced a long publication list and been promoted to a full professorship.
Lest the philosophical reader wonder (quite appropriately) whether these impressions of the psychological literature ought perhaps to be dismissed as mere"sour grapes"from an embittered, low-publication psychologist manque,it may be stated that the author(a past president of the American Psychological Association)has published over 70 technical books or article sin both "hard"and "soft"fields of psychology, is a recipient of Scientific Contributor Award,also of the Distinguished the Association's has been elected to Fellowship in the Contributor Award of the Division of Clinical Psychology, American Academy of Arts and Sciences,and is actively engaged in both theoretical and empirical research at the present time. He's not mad at anybody-but he is a bit distressed at the state of psychology."
#psychology #statistics #philosophyofscience #bayesian