Press question mark to see available shortcut keys

"Bayesian data analysis", Kruschke 2010

Bayesian methods have garnered huge interest in cognitive science as an approach to models of cognition and perception. On the other hand, Bayesian methods for data analysis have not yet made much headway in cognitive science against the institutionalized inertia of 20th century null hypothesis significance testing (NHST). Ironically, specific Bayesian models of cognition and perception may not long endure the ravages of empirical verification, but generic Bayesian methods for data analysis will eventually dominate. It is time that Bayesian data analysis became the norm for empirical methods in cognitive science. This article reviews a fatal flaw of NHST and introduces the reader to some benefits of Bayesian data analysis. The article presents illustrative examples of multiple comparisons in Bayesian analysis of variance and Bayesian approaches to statistical power.

This brief article assumes that you, dear reader, are a practitioner of null hypothesis significance testing, hereafter abbreviated as NHST. In collecting data, you take care to insulate the data from your intentions. For example, double-blind procedures in clinical trials insulate the data from experimenter intentions. As another example, in field research, the observers construct elaborate ‘duck blinds’ to minimize the impact of the observer on the data. After carefully collecting the data, you then go through the ritual invocation of p < 0.05. Did you know that the computation of the p value depends crucially on the covert intentions of the analyst, or the analyst’s interpretations of the unknowable intentions of the data collector? This is true despite the emphasis by the data collector to make the data unaffected by his/her intentions, as will be shown below. Moreover, for any set of data, an intention can be found for which p is not less than 0.05.

To make the issue concrete, consider an example. You have a scintillating hypothesis about the effect of some different treatments on a metric dependent variable. You collect some data (carefully insulated from your hopes about differences between groups) and compute a t statistic for two of the groups. The computer program, that tells you the value of t, also tells you the value of p, which is the probability of getting that t by chance from the null hypothesis. You want the p value to be less than 5%, so that you can reject the null hypothesis and declare that your observed effect is significant.
What is wrong with that procedure? Notice the seemingly innocuous step from t to p. The p value, on which your entire claim to significance rests, is conjured by the computer program with an assumption about your intentions when you ran the experiment. The computer assumes you intended, in advance, to fix the sample sizes in the groups. In a little more detail, and this is important to understand, the computer figures out the probability that your t value could have occurred from the null hypothesis if the intended experiment was replicated many, many times.
...The null hypothesis sets the two underlying populations as normal populations with identical means and variances. If your data happen to have six scores per group, then, in every simulated replication of the experiment, the computer randomly samples exactly six data values from each underlying population, and computes the t value for that random sample. Usually t is nearly zero, because the sample comes from a null hypothesis population in which there is zero difference between groups. By chance, however, sometimes the sample t value will be fairly far above or below zero.
...The arrow in Figure 1 marks the critical value tcrit at which the probability of getting a t value more extreme is 5%. We reject the null hypothesis if t_obs > t_crit . In this case, when N = 6 is fixed for both groups, t_crit = 2.23. This is the critical value shown in standard textbook t tables, for a two-tailed t-test with 10 degrees of freedom.
In computing p, the computer assumes that you did not intend to collect data for some time period and then stop; you did not intend to collect more or less data based on an analysis of the early results; you did not intend to have any lost data replaced by additional collection. Moreover, you did not intend to run any other conditions ever again, or compare your data with any other conditions. If you had any of these other intentions, or if the analyst believes you had any of these other intentions, the p value can change dramatically.

Example: In most of my research, I have only a rough sample size in mind, and I collect data for a period of time until that rough sample size is achieved. For example, I will post session times for which volunteers can sign up, and the posted times span, say, a 2-week period. I expect some typical rate of subject recruitment during that span of time, hoping to get a sample size in the desired range.
It is easy to generate a sampling distribution for t under these intentions. Specifically, suppose that the mean rate of subject sign-ups is six per week, with the actual number randomly generated by a simple Poisson process, as is often used to model the arrival of customers in a queue.2 The Poisson distribution generates an integer value between zero and infinity, with a mean, in this case, of 12 for a 2-week duration. Each subject is randomly assigned to one of the groups by a flip of a fair coin. Thus, in some random replications, we will happen to get six subjects in each group, but in other replications we will get, say, five subjects in one group and eight in the other. (On those rare occasions when this process allocates fewer than two subjects to a group, the number of subjects is promoted to two.) For every simulated replication, the t value is computed. The resulting distribution of t values is shown in the lower panel of Figure 1. The value of t, at which only 5% of the distribution is more extreme, is t_crit = 2.44.
In summary, if the intention was to collect six subjects per group, then the null hypothesis predicts t values distributed according to the upper panel of Figure 1, but if the intention was to collect data for 2 weeks, with a mean rate of six subjects per week, then the null hypothesis predicts t values distributed according to the lower panel of Figure 1.
Suppose we are handed some data from two groups, with six values per group, we compute t and find that t = 2.35. Do we reject the null hypothesis? According to NHST, we can only answer that question when we ascertain the intention of the experimenter. We ask the research assistant who collected the data. The assistant says, ‘I just collected data for 2 weeks. It is my job. I happened to get six subjects in each group’. We ask the graduate student who oversaw the assistant. The student says, ‘I knew we needed six subjects per group, so I told the assistant to run for 2 weeks. We usually get about six subjects per week’. We ask the lab director, who says, ‘I told my graduate student to collect six subjects per group’. Therefore, for the lab director, t = 2.35 rejects the null hypothesis, but for the research assistant who actually collected the data, t = 2.35 fails to reject the null hypothesis.

The only application in which standard textbooks regularly acknowledge the role of intent is multiple comparisons. When there are multiple groups, the analyst has the option of many different comparisons of different groups and combinations of groups. With every comparison, there is new opportunity to commit a false alarm, i.e., to have a t value that is spuriously or accidentally larger than the critical t value, even though there is no actual difference between groups. For example, consider a situation in which there are four groups, with intentionally fixed N = 6 per group. Suppose the analyst computes a t value for comparing groups 1 and 2. If the underlying populations are not actually different, then there is a 5% chance that the sample t will exceed tcrit = 2.23, i.e., there is a 5% false alarm rate. Suppose the analyst also computes the t value for comparing groups 3 and 4. Again there is a 5% chance that the sample t will exceed tcrit = 2.23 if the null hypothesis is true. If the analyst conducts both tests, however, there is a greater chance of committing a false alarm, because a false alarm can arise if either test happens to spuriously exceed the critical value.
...
Now, suppose we actually run the experiment. We randomly assign 30 people to the five groups, six people per group. The first group gets the placebo, and the other four groups get the corresponding four drugs. We are careful to make this a double-blind experiment: neither the subjects nor experimenters know who is getting which treatment. Moreover, no one knows whether any other person is even in the experiment. We collect the data. Our first question is to compare the placebo and the first drug, i.e., group 1 versus group 2. We compute the t statistic for the data from the two groups and find that t = 2.95. Do we decide that the two treatments had significantly different effects?
The answer, bizarrely, depends on the intentions of the person we ask. Suppose, for instance, that we handed the data from the first two groups to a research assistant, who is asked to test for a difference between groups. The assistant runs a t-test and finds t = 2.95, declaring it to be highly significant because it greatly exceeds the critical value of 2.23 for a two-group t-test. Suppose, on the other hand, that we handed the data from all five groups to a different research assistant, who is asked to compare the first group against each of the other four. This assistant runs a t-test of group 1 versus group 2 and finds t = 2.95, declaring it to be marginally significant because it just squeezes past the critical value of 2.95 for these four planned comparisons. Suppose, on yet another hand, that we handed the data from all five groups to a different research assistant, who is told to conduct all pairwise comparisons post hoc because we have no strong hypotheses about which treatments will have beneficial or detrimental or neutral effects. This assistant runs a t-test of group 1 versus group 2 and finds t = 2.95, declaring it to be not significant because it fails to exceed the critical value of 3.43 that is used for post hoc pairwise comparisons. Notice that regardless of which assistant analyzed the data, the t value for the two groups stayed the same because the data of the two groups stayed the same.

Some practitioners of NHST give the impression that a lot of problems would be solved if researchers would report a confidence interval and not only a p value.3 A confidence interval does convey more information than a point estimate and a p value, but rarely acknowledged is the fact that confidence intervals depend on the experimenter’s intentions in the same way as p values.

It is trivial to make any observed difference between groups non-significant. Just imagine a large set of other groups that should be compared with the two initial groups, and earnestly intend to compare the two groups with all the other groups, once you eventually collect the data. The false alarm rate sky rockets and any observed difference between the first two groups becomes insignificant. The analogous result holds for confidence intervals: the confidence interval becomes huge, merely by intending to compare the first two groups with lots of other groups.
Shared publiclyView activity