MBD Includes
Superiority and Equivalence Tests: an Update on Magnitude-Based Decisions as Hypothesis Tests Will G Hopkins Sportscience 24,
sportsci.org/2020/MBDtestComments/wgh.htm, 2020 Summary: My article on MBD
as hypothesis tests has been updated to make it clear that MBD automatically
includes superiority and equivalence tests. In this commentary I give the
background to the update and explain these tests. A journal editor recently included the
following message accompanying reviewers' comments for a manuscript on which
I am a co-author. "Considering the recent criticisms that
magnitude-based decisions (MBD) has received from the statistics community
and the increasing number of sports science and medicine journals that are no
longer accepting articles that include MBD, are the authors willing to
revisit their statistical analyses and perhaps consider using minimal-effects
testing (MET) or equivalence testing (ET)? I am not a statistician and I
certainly do appreciate the intention of MBD, particularly in an elite-sport
setting, but it has been recently suggested that MET or ET should be used
instead of non-clinical MBD. (Please see this article at the SportRxiv site for further information: https://osf.io/preprints/sportrxiv/pn9s3/.)"
The short answer to this comment is that MBD
automatically includes these two tests, and I have now updated my article
(Hopkins, 2020a) to make
that point clear. In this commentary I will provide a plain-language
explanation. I have also explained the tests in my article on
sample-size estimation (Hopkins, 2020b). The
article referred to by the editor is a preprint by Janet Aisbett and others
showing the equivalence of MBD with hypothesis testing, and it has yet to be
accepted by a peer-reviewed journal. A minimum-effects
test is Daniel Lakens' preferred name for a superiority test (Lakens et al., 2018). The idea is to test the hypothesis that the
effect is non-superior (e.g., not substantially positive or beneficial, i.e.,
anything less than the smallest important positive or beneficial effect). If
you can reject the hypothesis, then you have shown that the effect is substantial, in other words more
than some minimum value (the smallest important), or superior for an effect
representing a comparison of treatments, hence the names for the test. An
effect reported in MBD as very likely
substantial is equivalent to performing this test and rejecting the
underlying hypothesis of non-superior at the 0.05 level. Why? Because very likely substantial implies that more
than 95% of the area of the sampling distribution for the effect lies above
(for a positive effect) the smallest important value; hence the area of the
tail of the distribution extending below the smallest important is less than
5%, and the area of this tail is the p value for test of the hypothesis that
the effect is non-substantially positive. Equivalently, the 90% compatibility
interval falls entirely above the smallest important value, and therefore, to
put it in the strict frequentist terms favored by Sander Greenland (Gelman & Greenland, 2019), less
than 5% of values of the effect compatible with the data and the model are
below the smallest important. An equivalence
test is perhaps better referred to as a triviality
test. The idea now it is to test the hypothesis that the effect is
non-trivial, which means any value outside the region of trivial values
defined by the smallest important positive and negative values. If you can
reject this hypothesis, then you have shown that the effect is trivial, or equivalent for an
effect representing a comparison of treatments, and hence the names for this
test. In MBD, a very likely trivial effect has a 90% compatibility interval
falling entirely in trivial values, so the hypothesis that the effect is
substantially positive and the hypothesis that the effect is substantially
negative can both be rejected at the 0.05 level, so you can conclude that the
effect is indeed trivial. A subtle point here is that a 90% interval that
falls just inside trivial values is represented in MBD as only likely trivial, because the area of
the sampling distribution is just over 90%. In other words, some likely
trivial effects in MBD may be considered as decisively trivial from the
perspective of hypothesis tests. I prefer to think of effects as decisively
trivial only when they are at least very likely trivial. Note
that when you state that an effect is very likely substantial or very likely
trivial, you are using the original and valid reference-Bayesian version of
MBD, in which your prior information or belief about the true value of the
effect is so weakly informative that it has negligible effect on the posterior
probability. The equivalent hypothesis tests in MBD are meant to satisfy
statisticians of the frequentist persuasion that MBD also has a valid theoretical
basis in hypothesis testing. Whether you need to state the outcomes in terms
of hypothesis tests will depend on requirements of the journal, and the
appendix in my article has more guidance on this issue. I favor the Bayesian
version, because it is up-front with the probabilities, whereas the
frequentist version ends up painting the outcomes in black and white: here,
the effect is or isn't decisively substantial or trivial. When you state very likely substantial or very likely trivial, you are reminding
yourself and the reader that nothing is certain. The only advantage of
stating decisively is to thereby
allow computation of error rates: if we allow that very likely means black-and-white
decisively, then the error rates for getting it wrong can be
calculated. As Alan Batterham and I
showed, the error rates are acceptable (Hopkins & Batterham, 2016). This
claim applies also to clinical MBD, where possibly beneficial (and most
unlikely harmful) is considered evidence for potential implementation: if you
recommend implementation, you will incur an error if the true value of the
effect turns out to be trivial, but you should always remember and report
that the effect is only possibly beneficial. Gelman
A, Greenland S. (2019). Are confidence intervals better termed “uncertainty
intervals”? BMJ 366, I5381. Hopkins
WG. (2020a). Magnitude-based decisions as hypothesis tests. Sportscience 24,
1-16. Hopkins
WG. (2020b). Sample-size estimation for various inferential methods.
Sportscience 24, 17-27. Hopkins
WG, Batterham AM. (2016). Error rates, decisive outcomes and publication bias
with several inferential methods. Sports Medicine 46, 1563-1573. Lakens
D, Scheel AM, Isager PM. (2018). Equivalence testing for psychological
research: a tutorial. Advances in Methods and Practices in Psychological
Science 1, 259-269. Back
to index of comments. Back
to Magnitude-Based Decisions as Hypothesis Tests. First published 29
August 2020. ©2020 |