How so? Please show us the methodology used in these studies.
It can be proved that most published research findings are false.
Published research findings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [
1–3] to the most modern molecular research [
4,
5]. There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims [
6–8]. However, this should not be surprising.
It can be proven that most claimed research findings are false.
Several methodologists have pointed out [
9–11] that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of
the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by
p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on
p-values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null findings.
As has been shown previously, the probability that a research finding is indeed true depends on
the prior probability of it being true (before doing the study), the
statistical power of the study, and
the level of statistical significance [
10,
11]. Consider a 2 × 2 table in which research findings are compared against the gold standard of true relationships in a scientific field. In a research field both true and false hypotheses can be made about the presence of relationships. Let
R be the ratio of the number of “true relationships” to “no relationships” among those tested in the field.
R is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships. The pre-study probability of a relationship being true is
R/(
R+ 1). The probability of a study finding a true relationship reflects the power 1 - β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α. Assuming that
c relationships are being probed in the field, the expected values of the 2 × 2 table are given in
Table 1. After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [
10]. According to the 2 × 2 table, one gets PPV = (1 - β)
R/(
R - βR + α). A research finding is thus more likely true than false if (1 - β)
R > α.
Since usually the vast majority of investigators depend on a = 0.05, this means that a research finding is more likely true than false if (1 - β)R > 0.05.
What is less well appreciated is that bias and the extent of repeated independent testing by different teams of investigators around the globe may further distort this picture and may lead to
even smaller probabilities of the research findings being indeed true. We will try to model these two factors in the context of similar 2 × 2 tables.
In the presence of bias (
Table 2), one gets PPV = ([1 - β]
R +
uβ
R)/(
R + α − β
R+
u −
uα +
uβ
R), and PPV decreases with increasing
u, unless 1 − β ≤ α, i.e., 1 − β ≤ 0.05 for most situations. Thus,
with increasing bias, the chances that a research finding is true diminish considerably.
The probability that at least one study, among several done on the same question, claims a statistically significant research finding is easy to estimate. For
n independent studies of equal power, the 2 × 2 table is shown in
Table 3: PPV =
R(1 − β
n)/(
R + 1 − [1 − α]
n −
Rβ
n) (not considering bias).
With increasing number of independent studies, PPV tends to decrease, unless 1 - β < a, i.e., typically 1 − β < 0.05. This is shown for different levels of power and for different pre-study odds in
Figure 2. For
n studies of different power, the term β
n is replaced by the product of the terms β
i for
i = 1 to
n, but inferences are similar.
In the described framework, a PPV exceeding 50% is quite difficult to get.
Table 4 provides the results of simulations using the formulas developed for the influence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specific study designs and settings. A finding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is eventually true about 85% of the time. A fairly similar performance is expected of a confirmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely,
a meta-analytic finding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research findings from underpowered, early-phase clinical trials
would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature
perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in five chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [
30,
31],
PPV for each claimed relationship is extremely low, even with considerable standardization of laboratory and statistical methods, outcomes, and reporting thereof to minimize bias.