MBD Includes superiority and equivalence testing

SPORTSCIENCE · sportsci.org
Perspectives / Research Resources

MBD Includes Superiority and Equivalence Tests: an Update on Magnitude-Based Decisions as Hypothesis Tests

Will G Hopkins

Sportscience 24, sportsci.org/2020/MBDtestComments/wgh.htm, 2020
Institute for Health and Sport, Victoria University, Melbourne, Australia. Email.

Summary: My article on MBD as hypothesis tests has been updated to make it clear that MBD automatically includes superiority and equivalence tests. In this commentary I give the background to the update and explain these tests.

A journal editor recently included the following message accompanying reviewers' comments for a manuscript on which I am a co-author. "Considering the recent criticisms that magnitude-based decisions (MBD) has received from the statistics community and the increasing number of sports science and medicine journals that are no longer accepting articles that include MBD, are the authors willing to revisit their statistical analyses and perhaps consider using minimal-effects testing (MET) or equivalence testing (ET)? I am not a statistician and I certainly do appreciate the intention of MBD, particularly in an elite-sport setting, but it has been recently suggested that MET or ET should be used instead of non-clinical MBD. (Please see this article at the SportRxiv site for further information: https://osf.io/preprints/sportrxiv/pn9s3/.)"

The short answer to this comment is that MBD automatically includes these two tests, and I have now updated my article (Hopkins, 2020a) to make that point clear. In this commentary I will provide a plain-language explanation. I have also explained the tests in my article on sample-size estimation (Hopkins, 2020b). The article referred to by the editor is a preprint by Janet Aisbett and others showing the equivalence of MBD with hypothesis testing, and it has yet to be accepted by a peer-reviewed journal.

A minimum-effects test is Daniel Lakens' preferred name for a superiority test (Lakens et al., 2018). The idea is to test the hypothesis that the effect is non-superior (e.g., not substantially positive or beneficial, i.e., anything less than the smallest important positive or beneficial effect). If you can reject the hypothesis, then you have shown that the effect is substantial, in other words more than some minimum value (the smallest important), or superior for an effect representing a comparison of treatments, hence the names for the test. An effect reported in MBD as very likely substantial is equivalent to performing this test and rejecting the underlying hypothesis of non-superior at the 0.05 level. Why? Because very likely substantial implies that more than 95% of the area of the sampling distribution for the effect lies above (for a positive effect) the smallest important value; hence the area of the tail of the distribution extending below the smallest important is less than 5%, and the area of this tail is the p value for test of the hypothesis that the effect is non-substantially positive. Equivalently, the 90% compatibility interval falls entirely above the smallest important value, and therefore, to put it in the strict frequentist terms favored by Sander Greenland (Gelman & Greenland, 2019), less than 5% of values of the effect compatible with the data and the model are below the smallest important.

An equivalence test is perhaps better referred to as a triviality test. The idea now it is to test the hypothesis that the effect is non-trivial, which means any value outside the region of trivial values defined by the smallest important positive and negative values. If you can reject this hypothesis, then you have shown that the effect is trivial, or equivalent for an effect representing a comparison of treatments, and hence the names for this test. In MBD, a very likely trivial effect has a 90% compatibility interval falling entirely in trivial values, so the hypothesis that the effect is substantially positive and the hypothesis that the effect is substantially negative can both be rejected at the 0.05 level, so you can conclude that the effect is indeed trivial. A subtle point here is that a 90% interval that falls just inside trivial values is represented in MBD as only likely trivial, because the area of the sampling distribution is just over 90%. In other words, some likely trivial effects in MBD may be considered as decisively trivial from the perspective of hypothesis tests. I prefer to think of effects as decisively trivial only when they are at least very likely trivial.

Note that when you state that an effect is very likely substantial or very likely trivial, you are using the original and valid reference-Bayesian version of MBD, in which your prior information or belief about the true value of the effect is so weakly informative that it has negligible effect on the posterior probability. The equivalent hypothesis tests in MBD are meant to satisfy statisticians of the frequentist persuasion that MBD also has a valid theoretical basis in hypothesis testing. Whether you need to state the outcomes in terms of hypothesis tests will depend on requirements of the journal, and the appendix in my article has more guidance on this issue. I favor the Bayesian version, because it is up-front with the probabilities, whereas the frequentist version ends up painting the outcomes in black and white: here, the effect is or isn't decisively substantial or trivial. When you state very likely substantial or very likely trivial, you are reminding yourself and the reader that nothing is certain. The only advantage of stating decisively is to thereby allow computation of error rates: if we allow that very likely means black-and-white decisively, then the error rates for getting it wrong can be calculated. As Alan Batterham and I showed, the error rates are acceptable (Hopkins & Batterham, 2016). This claim applies also to clinical MBD, where possibly beneficial (and most unlikely harmful) is considered evidence for potential implementation: if you recommend implementation, you will incur an error if the true value of the effect turns out to be trivial, but you should always remember and report that the effect is only possibly beneficial.

Gelman A, Greenland S. (2019). Are confidence intervals better termed “uncertainty intervals”? BMJ 366, I5381.

Hopkins WG. (2020a). Magnitude-based decisions as hypothesis tests. Sportscience 24, 1-16.

Hopkins WG. (2020b). Sample-size estimation for various inferential methods. Sportscience 24, 17-27.

Hopkins WG, Batterham AM. (2016). Error rates, decisive outcomes and publication bias with several inferential methods. Sports Medicine 46, 1563-1573.

Lakens D, Scheel AM, Isager PM. (2018). Equivalence testing for psychological research: a tutorial. Advances in Methods and Practices in Psychological Science 1, 259-269.

Back to index of comments.

Back to Magnitude-Based Decisions as Hypothesis Tests.

First published 29 August 2020.