The Vindication of Magnitude-Based Inference
(draft 2) Will G
Hopkins, Alan M Batterham Sportscience 22, 19-27, 2018
(sportsci.org/2018/mbivind.htm) |
||||

The first version of this article contained several
misinterpretations of Sainani's assertions. Updated versions will incorporate
points raised in comments. To forestall claims that we have not accounted for
certain theoretical frequentist and Bayesian issues, we have provided some technical
notes. Magnitude-based inference (MBI) is an approach to making a decision
about the true or population value of an effect statistic, taking into
account the uncertainty in the magnitude of the statistic provided by a
sample of the population. In response to concerns about error rates with the
decision process (Welsh
and Knight, 2015), we
recently showed that MBI is superior to the traditional approach to
inference, null-hypothesis significance testing (NHST) (Hopkins and Batterham, 2016). Specifically,
the error rates are comparable and often lower than those of NHST, the
publishability rates with small samples are higher, and the potential for
publication bias is negligible. A statistician from Stanford University, Kristin Sainani, has
now attempted to refute our claims about the superiority of MBI to NHST (Sainani, 2018). We
acknowledge the effort expended in her detailed scrutiny and welcome the
opportunity to discuss the points raised in the spirit of furthering
understanding. Sainani argues that MBI should not be used, and that we should
instead "adopt a fully Bayesian analysis" or merely interpret the
standard confidence interval as a plausible range of effect magnitudes
consistent with the data and model. We have no objection to researchers using
either of these two approaches, if they so wish. Nevertheless, we have shown
before and show here again that MBI is a valid, robust approach that has
earned its place in the statistical toolbox. The title of Sainani's critique refers to "the
problem" with magnitude-based inference (MBI), but in the abstract she
claims that there are several problems with the Type-I and Type-II error
rates. In the article itself, she begins her synopsis of MBI with another
apparent problem: that the probabilistic statements in MBI about the
magnitude of the true effect are invalid. Throughout the critique are numerous
inconsistencies and mistakes. We solve here all her perceived problems,
highlight her inconsistencies and correct her mistakes. Should researchers make probabilistic assertions about the true
(population) value of effects? Absolutely, especially for clinically
important effects, where implementation of a possibly beneficial effect in a
clinical or other applied setting carries with it the risk of harm. We use
the term Sainani states early on that "I completely agree with and
applaud" the approach of interpreting the range of magnitudes of an
effect represented by its upper and lower confidence limit, when reaching a
decision about a clinically important effect. But, according to Sainani,
"where Hopkins and Batterham's method breaks down is when they go beyond
simply making qualitative judgments like this and advocate translating
confidence intervals into probabilistic statements, such as the effect of the
supplement is 'very likely trivial' or 'likely beneficial.' This requires
interpreting confidence intervals incorrectly, as if they were Bayesian
credible intervals." We have addressed this concern previously (Hopkins and Batterham, 2016). The
usual confidence interval is congruent with a Bayesian credibility interval
with a minimally informative prior (Burton, 1994; Burton et al.,
1998; Spiegelhalter et al., 2004). As
such, it is an objective estimate of the likely range of the true value, and the
associated probabilistic statements of MBI are Bayesian posterior
probabilities with a minimally informative prior. Unfortunately, full Bayesians disown us, because we prefer not
to turn belief into an informative subjective prior. Meanwhile, NHST-trained
statisticians disown us, because we do not test hypotheses. MBI is therefore
well placed to be a practical haven between a Bayesian rock and an NHST hard
place. (Others have attempted hybrids of Bayes and NHST, albeit with
different goals. See the technical notes.) From Bayesians we adapt valid
probabilistic statements about the true effect, based on a minimally
informative prior. From NHST we adapt straightforward computational methods
and assumptions, and we compute error rates for decisions based on
sufficiently low or high probabilities for the true effect. Whether these
error rates are acceptable is an issue we will address shortly. There is a logical inconsistency in Sainani's "qualitative
judgment" of confidence intervals. In her view, it is not appropriate to
make a probabilistic assertion about the true magnitude of the effect, but it
is appropriate to interpret the magnitude of the lower and upper confidence
limits. The problem with this approach is that it all depends on the level of
the confidence interval, so she is in fact making a There is a further inconsistency with Sainani's applause for
qualitative judgments based on the confidence interval: the fact that her
concerns about error rates in MBI would apply to such judgments. Consider,
for example, a confidence interval that overlaps trivial and substantial
magnitudes. What is her qualitative judgment? The effect could be trivial or
substantial, of course. Where is the error in that pronouncement? If the true
effect is trivial, we say there is none, but she says there is an
unacceptable ill-defined Type-I error rate. The only way she can keep a
well-defined NHST Type-I error rate is to make a qualitative judgment Sainani is also inconsistent when she makes the following
statement: "Hopkins and Batterham's logic is that as long as you
acknowledge even a small chance (5-25%) that the effect might be trivial when
it is [truly trivial], then you haven't made a Type I error… But this seems
specious. Is concluding that an effect is 'likely' positive really an
error-free conclusion when the effect is in fact trivial?" Consider the
confidence-interval equivalent of Sainani's statement. A small chance that
the effect could be trivial corresponds to a confidence interval covering
mostly substantial values, with a slight overlap into trivial values, such
that the probability of a trivial true effect is only 6%, for example. Hence
we say the effect could be trivial, so no Type-I error occurs (Figure 1).
Now consider what happens in NHST. If the 95% confidence
interval overlaps the null only slightly, with p=0.06, then a Type-I error
has not occurred (Figure 1). In other words, it's the same kind of decision
process as for MBI, except that in MBI the null is replaced with the smallest
important effect. The same argument could be mounted for Type-II errors:
Sainani does not specifically call our logic here specious, but she does show
later that our definitions "wildly underestimate" the traditional
Type-II error rates. We will not be held accountable for error rates based on
the null hypothesis. Sainani offers a novel solution to her perceived problem with
the definition of MBI Type-I error: allow for "degrees of error",
which inevitably makes higher Type-I error rates. But a similar inflation of
error rates would occur with NHST, if degrees of error were assigned to p
values that approach significance. We doubt if her solution would solve the
problems of the p value that are increasingly voiced in the literature; in
any case, we do not see the need for it with MBI. When an effect is possibly
trivial and possibly substantially positive, that is what the researcher has
found: it's on the way to being substantially positive. Furthermore, for
effects with true values that are close to the smallest important effect, the
outcome with even very large sample sizes will usually be Turning now to the problem of error rates in MBI, we find some
agreement and some disagreement with Sainani about the definitions of error.
We consider that we made a breakthrough with our definitions, because they
focus on trivial and substantial magnitudes rather than the null. As we
stated in our Sports Medicine article (Hopkins and Batterham, 2016), a
valid head-to-head comparison of NHST and MBI requires definitions of Type-I (false-positive)
and Type-II (false-negative) error rates that can be applied to both
approaches. In the traditional definition of a Type-I error, a truly null
effect turns out to be statistically significant. Sample-size estimation in
NHST is all about getting significance for substantial effects, so we argued
that a Type-I error must also occur when any truly In her opening statement on definitions of error, Sainani states
that "Hopkins and Batterham are confused about what to call cases in
which there is a true non-trivial effect, but an inference is made in the
wrong direction (i.e., inferring that a beneficial effect is harmful or that
a harmful effect is beneficial). In the text, they switch between calling
these Type I and Type II errors." Yes, we may have caused confusion with
the following statement: "…implementation of a harmful effect
represents a more serious error than failure to implement a beneficial
effect. Although these two kinds of error are both false-negative type II
errors, they are analogous to the statistical type I and II errors of NHST,
so they are denoted as clinical type I and type II errors,
respectively." They are denoted as clinical Type-I and Type-II errors in
the spreadsheet for sample-size estimation at the Sportscience site, but they
are correctly identified as Type-II errors in our figure defining the errors,
in the text earlier in the article, and in the figures summarizing error
rates. Sainani goes on to state that "in their calculations, they treat
them both as Type II errors (Table 1a). But they can't both be Type II errors
at the same time." We do not understand this assertion, or her
justification of it involving one-tailed tests (but see the technical notes). She
concludes with "inferring that a beneficial effect is harmful is a Type
II error," with which we agree, "whereas inferring that a harmful
effect is beneficial is a Type I error," with which we disagree. When a
true harmful effect is inferred not to be harmful, it is a Type-II error.
Sainani also notes that a true substantial effect inferred to be substantial
of opposite sign can be called a Type-III error, but we see no need for this
additional complication. That said, we do see the need to control the error
rate when truly harmful effects are inferred to be potentially beneficial.
Our rebuttal of Sainani's assertions about error rates might not
satisfy fundamentalist adherents of NHST. Figure 2 shows our original figure from
the Sports Medicine article and an enlargement of the Type-I rates. We did
not misrepresent these rates in the text, but arguably we presented them in a
manner that favored MBI: "For null and
positive trivial values, the type I rates for clinical MBI exceeded those for
NHST for a sample size of 50+50 (~15–70 % versus ~5–40 %), while for the
largest sample size, the type I rates for clinical MBI (~2–75 %) were
intermediate between those of conservative NHST (~0.5–50 %) and conventional
NHST (5–80 %)." These error rates are consistent with those presented by
Sainani, but the changes of scale for the different true-effect magnitudes in
her figure gives an unfavorable impression of the
MBI rates. We gave an honest account of the higher Type-I error rates with
odds-ratio MBI, which Sainani did not address. Our justification for keeping
this version of MBI in the statistical toolbox along with clinical MBI seems
reasonable. From the Sports Medicine article: "The Type-I rates for
clinical MBI were substantially higher than those for NHST for null and
positive true values with a sample size of 50+50. The probabilistic
inferences for the majority of these errors were only possibly beneficial, so
a clinician would make the decision to use a treatment based on the effect,
knowing that there was not a high probability of benefit. Type-I error rates
for odds-ratio MBI were the largest of all the inferential methods for null
and positive trivial effects, but for the most part these rates were due to
outcomes where the chance of benefit was rated unlikely or very unlikely, but
the risk of harm was so much lower that the odds ratio was >66. Inspection
of the confidence intervals for such effects would leave the clinician with
little expectation of benefit if the effect were implemented, so the high
Type-I error rates should not be regarded as a failing of this
approach." In her discussion, Sainani asserts: "Whereas standard
hypothesis testing has predictable Type I error rates, MBI has Type I error
rates that vary greatly depending on the sample size and choice of thresholds
for harm/benefit. This is problematic because unless researchers calculate
and report the Type I error for every application, this will always be hidden
to readers." But the "well-defined" Type-I rate for NHST is
only for the null; for trivial true effects it also varies widely with sample
size and choice of magnitude thresholds, and this variation is also hidden
from readers. The fact that the Type-I error rate for MBI peaks at the
optimum sample size (the minimum sample size for practically all outcomes to
be clear) is no cause for concern, because sample-size estimation in MBI is
based on controlling the Type-II rates. She goes on with this particularly
galling assertion: "Furthermore, the dependence on the thresholds for
harm/benefit makes it easy to game the system. A researcher could tweak these
values until they get an inference they like." This is a fatuous charge
to level against MBI. Any system of inference is open to abuse, if
researchers are so minded. A researcher who assesses the importance of
a statistically significant or non-significant outcome can choose the value
of the smallest important effect at that stage to suit the outcome obtained
with the sample. Researchers also game the NHST system by providing a
justification for sample size based on moderate effects. Sainani presumably
has the same concerns about full (subjective) Bayesians gaming not only the
smallest important effect but also the prior to get the most pleasing or
publishable outcome. Sainani's only remaining substantial concern about our
definition of error rates is not so easily dismissed. MBI provides a new
category of inferential outcome: Sainani concludes her critique with the following solution to
fix what she regards as the MBI Type-I error problem: "…a one-sided
null hypothesis test for benefit–interpreted alongside the corresponding
confidence interval–would achieve most of the objectives of clinical MBI
while properly controlling Type I error." We disagree. First, we do not
wish to conduct "tests" of any kind; we embrace uncertainty and
prefer estimation to "testimation", to
borrow from Ziliak and McCloskey (2008). Secondly, the p value from her proposed
one-sided test against the non-zero null given by the minimum clinically
important difference is precisely equivalent to 1 minus the probability of
benefit from MBI. If the one-sided test is conducted at a conventional 5%
alpha level, the implication is that Sainani requires >95% chance of
benefit to declare a treatment effective–equivalent to our Before we leave the issue of error rates, it is important to
note that the theoretical basis of NHST is now held to be untrustworthy by
some highly cited establishment statisticians. Consider, for example, the
following comments of two contributors to the American Statistical
Association's policy statement on p values (Wasserstein and Lazar, 2016; see the supplement):
"we should advise today’s students of statistics that they should avoid
statistical significance testing (Ken Rothman)" and "hypothesis
testing as a concept is perhaps the root cause of the problem (Roderick
Little)." If they are right, it follows that the traditional definitions
of Type-I and Type-II errors, both of which are based on the null hypothesis,
are themselves unrealistic and untrustworthy. Our definitions deserve more
recognition as a possible way forward. In her criticisms of the theory of MBI, Sainani claims that the
three references we cited in our Sports Medicine
article to support the sound theoretical
basis of MBI "do not provide such
evidence." We will now show that her claim is misleading
or incorrect for all three references. The first reference is Gurrin et al. (2000), from
which she quotes correctly: "Although the use of a uniform prior
probability distribution provides a neat introduction to the Bayesian
process, there are a number of reasons why the uniform prior distribution
does not provide the foundation on which to base a bold new theory of
statistical analysis!" However, she neglects to point out that later in
the same article Gurrin et al. make this statement:
"One of the problems with Bayesian analysis is that it is often a
non-trivial problem to combine the prior information and the current data to
produce the posterior distribution… The congruence between conventional
confidence intervals and Bayesian credible intervals generated using a
uniform prior distribution does, however, provide a simple way to obtain
inferences in Bayesian form which can be implemented using standard software
based on the results and output of a conventional statistical analysis… Our
approach [effectively MBI] is straightforward to implement, offers the
potential to describe the results of conventional analyses in a manner that
is more easily understood, and The second reference supporting MBI is Shakespeare et al. (2001).
Sainani states that this article "just provides general information on
confidence intervals, and does not address anything directly related to
MBI." On this point she is also wrong. The method presented by
Shakespeare et al. to derive what they refer to as "confidence
levels" uses precisely the same methods as MBI to derive the probability
of benefit beyond a threshold for the minimum clinically important
difference. For example, the authors present the following re-analysis of a
previously published study using their method: "The study found a
survival benefit of 28% favoring immediate nodal
dissection (hazard ratio 0·72, 95% CI 0·49–1·04). There is a… 94% level of
confidence [i.e., chance of benefit] that the survival benefit is clinically
relevant (improvement in survival of 3% or more). The third reference that she claims does not provide evidence
supporting MBI is our letter to the editor (Batterham and Hopkins, 2015) in
response to the article by Welsh and Knight (2015). By her
account, this reference "is a short letter in which they point to
empirical evidence from a simulation that I believe is a preliminary version
of the simulations reported in Sports Science [sic]." But the issue here
is the theoretical basis of MBI, which indeed we had argued succinctly in the
letter. Hence this claim also is wrong. Finally, the overarching negative tone of Sainani's critique
deserves attention. We counted three occasions in the article where she gives
any credit to our achievement with MBI, but each is immediately followed by
an assertion that we were misguided or mistaken. She is the one who is
misguided or mistaken. It is deeply disappointing and discouraging when
someone in her position of influence fails to notice or acknowledge the
following There is still room for debate that could result in improvements
in MBI. The most obvious debatable feature are the rules we have devised for
deciding when effects are clear in clinical and non-clinical settings–in
other words, the rules for acceptable uncertainty in the two settings. These
rules in turn depend on the threshold probabilities that define the terms We have demonstrated that the error rates in MBI are acceptable overall. However, those wishing to use MBI, but who remain concerned with error rates, could present an additional statistic with excellent error control, the second-generation p-value (SGPV) (Blume et al., 2018). Briefly, this statistic is based on an interval null hypothesis equivalent to the trivial region in MBI. The SGPV is not a probability; rather it is the proportion of hypotheses supported by the data and model that are trivial. If the SGPV=0, then the data support only clinically meaningful hypotheses. If the SGPV=1, then the data support only trivial hypotheses. Values between 0 and 1 reflect the degree of support for clinically meaningful or trivial hypotheses, with a SGPV of 0.5 indicating that the data are strictly inconclusive. In conclusion, MBI represents an honest mechanism for getting
smaller-scale studies into print without misrepresenting uncertainty in the
outcomes. Indeed, the uncertainty is represented by well-defined qualitative
categories of probability. It beggars belief that any journal reviewer or
editor could take exception to publication of an effect as being
## Technical notes
Throughout
this article, Some
full Bayesians have previously taken exception to the non-informative or
"flat" prior of MBI, by invoking two arguments. First, representing
such a prior mathematically is an intractable problem (Barker and Schofield, 2008). We
delighted in parodying this argument by calling the flat prior an imaginary
Bayesian monster (Hopkins and
Batterham, 2010): the
argument is easily dismissed simply by making the prior minimally informative,
which makes the prior tractable but makes no substantial difference to the
posterior. The second argument is that a uniform flat or minimally
informative prior must become non-uniform, if the dependent variable is
transformed, for example using logarithms or any of the transformations in
generalized linear modeling (e.g., Gurrin et al., 2000). Again,
this argument is easily dismissed: the flat or minimally informative prior is
applied to the transformation of the dependent variable in a model that makes
least non-uniformity of the effect and error compared with any other
transformations (including non-transformation) and models. What happens to
the prior with these other transformations and models is irrelevant. Interestingly,
if we were full Bayesians, we might not be expected to concern ourselves with
error control, as some full Bayesians distinguish "beliefs" from
estimates of "true" values; for them, frequentist notions such as
Type-I errors do not exist (Ventz and Trippa, 2015). A full
Bayesian–with the caveat that more than 30 years ago there were already
46,656 kinds (Good, 1982)–might
say, for example, that "75% of the credible values exceed the minimum
clinically important threshold for benefit”, whereas the MBI exponent would
claim that "the probability that the true value of the treatment exceeds
the threshold for benefit is 75%; that is, the treatment is likely
beneficial." In MBI, adopting a least-informative prior and making
decisions based on a posterior distribution equivalent to the likelihood
arguably requires us to give due consideration to error control, which we
have done. The general notion of Bayesian inference with a model chosen to
yield inferences with good frequency properties has been described as
"Calibrated Bayes" (Little, 2011; Little, 2006). Other
attempts at reconciling Bayesian and frequentist paradigms include
"Constrained Optimal Bayesian" designs (Ventz and Trippa, 2015). Meanwhile, to make probabilistic statements, Sainani
recommends we adopt a full Bayesian analysis, in which there is no apparent
requirement for error control, while lambasting MBI for having higher error
rates in some scenarios. Her position once again is inconsistent. ## References
Cohen J (1994). The earth is round (p < .05). American
Psychologist 49, 997-1003 Draft 2 published 14 May
2018. |