Press question mark to see available shortcut keys

Problems in null-hypothesis testing: in real world data, the distribution of p-values rarely is uniform - there are systematic biases all throughout your data and data collection processes, there are always lots of correlations throwing in noise, and a cutoff of 0.05 may be wrong on its own. Fortunately, you can do placebo or control tests: take variables you strongly believe the null hypothesis is true for, and see how often they blow the nominal 0.05 cutoff - if you test 100 nulls and get 10 results p<0.05, you know you need to lower your cutoff to something like p<0.01 to preserve your 5-false-positive error rate. The results from your 100 control tests become an empirical null distribution; you have empirically (in your particular dataset + analysis setup) estimated how many extreme results there are (distribution) for the (null) hypothesis.

"Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis", Efron 2004 http://www.stat.cmu.edu/~jiashun/Teaching/F08STAT756/Lectures/Efron.pdf lays out the logic of the argument, and "Interpreting observational studies: why empirical calibration is needed to correct p-values", Schuemie et al 2012, provides a nice medical statistics application - in datamining real-world medical records just like a real-world epidemiologist, how often can we show bogus side-effects for drugs? Very often, and to get an actual alpha control of 0.05, we need something more like 0.01 for any claims of harm. (And interestingly from the graph, the analyses seem to be biased towards finding harm rather than benefit, so while harms need to be tightened down to 0.01, benefits can actually be loosened up a bit from 0.05.)

Excerpts:

"Often the literature makes assertions of medical product effects on the basis of 'p < 0:05'. The underlying premise is that at this threshold, there is only a 5% probability that the observed effect would be seen by chance when in reality there is no effect. In observational studies, much more than in randomized trials, bias and confounding may undermine this premise. To test this premise, we selected three exemplar drug safety studies from literature, representing a case-control, a cohort, and a self-controlled case series design. We attempted to replicate these studies as best we could for the drugs studied in the original articles. Next, we applied the same three designs to sets of negative controls: drugs that are not believed to cause the outcome of interest. We observed how often p < 0:05 when the null hypothesis is true, and we fitted distributions to the effect estimates. Using these distributions, we compute calibrated p-values that reflect the probability of observing the effect estimate under the null hypothesis, taking both random and systematic error into account. An automated analysis of scientific literature was performed to evaluate the potential impact of such a calibration. Our experiment provides evidence that the majority of observational studies would declare statistical significance when no effect is present. Empirical calibration was found to reduce spurious results to the desired 5% level. Applying these adjustments to literature suggests that at least 54% of findings with p < 0:05 are not actually statistically significant and should be reevaluated.

With these potential advantages comes the recognition that observational studies can suffer from various biases and that results might not always be reliable. Results from observational studies often cannot be replicated [1, 2]. For example, two recent independent observational studies investigating oral bisphosphonates and the risk of esophageal cancer produced results leading to conflicting conclusions [3, 4], despite the fact that the two studies analyzed the same database over approximately the same period. A systematic analysis has suggested that the majority of observational studies return erroneous results [5]. The main source of these problems is that observational studies are more vulnerable than RCTs to systematic error such as bias and confounding. In RCTs, randomization of the exposure helps to ensure that exposed and unexposed populations are comparable. Observational studies, by definition, merely observe clinical practice, and exposure is no longer the only potential explanation for observed differences in outcomes. Many statistical methods exist that aim to reduce this systematic error, including self-controlled designs [6] and propensity score adjustment [7], but it is unclear to what extent these solve the problem.
Despite the fact that the problem of residual systematic error is widely acknowledged (often in the discussion section of articles), the results are sometimes misinterpreted as if this error did not exist. Most important, statistical significance testing, which only accounts for random error, is widely used in observational studies...Although we believe that most researchers are aware of the fact that traditional p-value calculations do not adequately take systematic error into account, likely because of a lack of a better alternative, the notion of statistical significance based on the traditional p-value is widespread in the medical literature.

In this research, we focus on the fundamental notion of statistical significance in observational studies, testing the degree to which observational analyses generate significant findings in situations where no association exists. To accomplish this, we selected two publications: one investigating the relationship between isoniazid and acute liver injury [8] and one investigating sertraline and upper gastrointestinal (GI) bleeding [9]. These two publications represent three popular study designs in observational research: the first publication used a cohort design, and the second paper used both a case-control design and a self-controlled case series (SCCS). We replicated these studies as best we could, closely following the specific design choices. However, because we did not have access to the same data, we used suitable substitute databases. For each study, we identified a set of negative controls (drug-outcome pairs for which we have confidence that there is no causal relationship) and explored the performance of the study designs. We show that the designs frequently yield biased estimates, misrepresent the p-value, and lead to incorrect inferences about rejecting the null hypothesis. We introduce a new empirical framework, based on modeling the observed null distribution for the negative controls that yields properly calibrated p-values for observational studies. Using this approach, we observe that about 5% of drug- outcome negative controls have p < 0:05, as is expected and desired. By applying this framework to a large set of historical effect estimates under various assumptions of bias, we show that for the majority of estimates currently considered statistically significant, we would not be able to reject the null hypothesis after calibration. Our framework provides an explicit formula for estimating the calibrated p-value using the traditionally estimated relative risk and standard error.

For our negative controls, we could either pick different drugs that are known not to cause the outcome of interest or pick outcomes that are known not to be caused by our drug of interest [17]. Picking different outcomes would be more difficult in observational studies because some study designs such as case-control are focused on outcomes and would require resampling of subjects and because outcomes are often more difficult to extract from observational data, requiring complex algorithms that need to be validated. On the other hand, different drug exposures are usually easily and fairly accurately identified in prescription tables, and we therefore have opted for using negative control drugs. We focus our analysis on the two outcomes in our example studies: acute liver injury and upper GI bleeding. Both outcomes arise frequently in drug safety studies. We attempted to perform an exhaustive search to define exposure controls for these two outcomes by starting with all drugs with an active structured product label. Subsequently, we selected only those drugs meeting the following criteria:
(1) The outcome of interest could not be listed in any section of the Food and Drug Administration structured product label, nor could any related outcome be listed.
(2) The drug could not be listed as a 'causative agent' for the outcome in the book Drug-Induced Diseases: Prevention, Detection and Management [18].
(3) A manual review of the literature found no studies showing the drug caused the outcome.
For acute liver injury, we found 37 negative controls, while for upper GI bleeding, 67 drugs met these criteria.

Traditional significance testing utilizes a theoretical null distribution that requires a number of assumptions to ensure its validity. Our proposed approach instead derives an empirical null distribution from the actual effect estimates for the negative controls. These negative control estimates give us an indication of what can be expected when the null hypothesis is true, and we use them to estimate an empirical null distribution. We fitted a Gaussian probability distribution to the estimates, taking into account the sampling error of each estimate

Our replication of the [original] published studies produced similar results [as published].

We applied the three study designs to the negative controls for the respective health outcomes of interest. Figure 1 shows the estimated odds ratios and incidence rate ratios, which can also be found in tabular form in the Supporting information, Appendix C. For the case-control and SCCS designs, applying the same study design to other drugs was straightforward. For the cohort method, most drugs had different comparator groups of patients and required recomputing propensity scores. For three of the negative controls for acute liver injury and 23 of the negative controls for upper GI bleeding, there were not enough data to compute an estimate, for instance, because none of the cases and none of the controls were exposed to the drug. However, an initial investigation in the minimum number of required controls showed the remaining number sufficed (Supporting information, Appendix E). Note that the number of exposed subjects varies greatly from drug to drug, from 67 subjects being exposed to neostigmine to 884,644 individuals having exposure to fluticasone (Supporting information, Appendix C). These differences account for the majority in variation of the widths of the confidence intervals.
From Figure 1, it is clear that traditional significance testing fails to capture the diversity in estimates that exists when the null hypothesis is true. Despite the fact that all the featured drug-outcome pairs are negative controls, a large fraction of the null hypotheses are rejected. We would expect only 5% of negative controls to have p < 0:05. However, in Figure 1A (cohort method), 17 of the 34 negative controls (50%) are either significantly protective or harmful. In Figure 1B (case-control), 33 of 46 negative controls (72%) are significantly harmful. Similarly, in Figure 1C (SCCS), 33 of 46 negative controls (72%) are significantly harmful, although not the same 33 as in Figure 1B.
These numbers cast doubts on any observational study that would claim statistical significance using traditional p-value calculations. Consider, for example, the odds ratio of 2.2 that we found for sertraline using the case-control method, we see in Figure 1B that many of the negative controls have similar or even higher odds ratios. The estimate for sertraline was highly significant (p < 0:001), meaning the null hypothesis can be rejected on the basis of the theoretical model. However, on the basis of the empirical distribution of negative controls, we can argue that we cannot reject the null hypothesis so readily.

For the three designs we considered, Table I provides the maximum likelihood estimates for the means and variances of the empirical null distributions. Interestingly, in our study, while the cohort method has nearly zero bias on average, the case-control and SCCS methods are positively biased on average. It is important to note that for all three designs,  O is not equal to zero, meaning that the bias in an individual study may deviate considerably from the average.
When eliminating the six drugs where we expect confounding by indication, the estimated parameters for the case-control design change slightly to  O D 0:76 and  O D 0:22.
Figure 2 shows for every level of  ̨ the fraction of negative controls for which the p-value is below  ̨, for both the traditional p-value calculation and the calibrated p-value using the empirically established null distribution. For the calibrated p-value, a leave-one-out design was used: for each negative control, the null distribution was estimated using all other negative controls. A well-calibrated p-value calculation should follow the diagonal: for negative controls, the proportion of estimates with p <  ̨ should be approximately equal to  ̨. Most significance testing uses an  ̨ of 0.05, and we see in Figure 2 that the calibrated p-value leads to the desired level of rejection of the null hypothesis. For the cohort method, case-control, and SCCS, the number of significant negative controls after calibration is 2 of 34 (6%), 5 of 46 (11%), and 3 of 46 (5%), respectively. Applying the calibration to our three example studies, we find that only the cohort study of isoniazid reaches statistical significance: p D 0:01. The case-control and SCCS analysis of sertraline produced p-values of 0.71 and 0.84, respectively.

The medical literature features many observational studies that use traditional significance testing to assert whether an effect was observed. Assuming that these studies have similar null distributions as our three example studies, we can test whether for historical significant findings, we can still reject the null hypothesis after calibration. Using an elaborate PubMed query (Supporting information, Appendix F), we identified 31,386 papers published in the last 10 years that applied a cohort, case-control, or SCCS design in a study using observational healthcare data. Through an automated text-mining procedure, we extracted 4970 articles where a relative risk, hazard, odds, or incidence rate ratio estimate was mentioned in the abstract...allowing us to recompute the calibrated p-value under various assumptions of bias. The full list of estimates and recomputed p-values can be found in the Supporting information, Appendix G.
Figure 4 shows the number of estimates per publication year. The vast majority of these estimates (82% of all estimates) are statistically significant under the traditional assumption of no bias. But even with the most modest assumption of bias (mean D 0, SD D 0.25), this number dwindles to less than half (38% of all estimates). This suggests that at least 54% of significant findings would be deemed non-significant after calibration. With an assumption of medium size bias (mean D 0.25, SD D 0.25), the number of significant findings decreases further (33% of all estimates), and assuming a larger but still realistic level of bias leaves only a few estimates with p < 0:05 (14% of all estimates).

even when it was clear that a method was highly positively biased, we found highly prevalent drugs with effect estimates barely above one were still deemed statistically significant using these methods because of the large original  z-value or small original p-value. Our intuition is that bias is irrespective of sample size and would remain present even in an infinitely large sample. We have therefore chosen to model our null distribution on the basis of the effect estimate, taking standard error into account as a measure of uncertainty.

The method proposed here aims to correct the type I error (erroneously rejecting the null hypothesis) level, most likely at the cost of vastly increasing the number of type II errors (erroneously rejecting the alternative hypothesis). Ideally, we would improve our study designs to better control for bias, which would result in  O and  O approaching 0, and thereby maximizing statistical power after calibration. In that case, our approach would no longer be needed for calibration, only to show that bias has been dealt with. However, as shown here, the study designs currently pervading literature fall short of this goal, and more work is needed to reach this (potentially unobtainable) goal. We recommend that observational studies always include negative controls to derive an empirical null distribution and use these to compute calibrated p-values."

#statistics  
Shared publiclyView activity