In the past two decades, striking examples of allegedly inferior
likelihood ratio tests (LRT) have appeared in the statistical literature. These
examples, which arise in multiparameter hypothesis testing problems, have
several common features. In each case the null hypothesis is composite, the
size LRT is not similar and hence biased, and competing size tests can be
constructed that are less biased, or even unbiased, and that dominate the LRT
in the sense of being everywhere more powerful. It is therefore asserted that
in these examples and, by implication, many other testing problems, the LR
criterion produces ‘‘inferior,’’
‘‘deficient,’’ ‘‘
undesirable,’’ or ‘‘flawed’’
statistical procedures.
¶ This message, which appears to be proliferating, is wrong. In each
example it is the allegedly superior test that is flawed, not the LRT. At
worst, the ‘‘superior’’ tests provide unwarranted
and inappropriate inferences and have been deemed scientifically unacceptable
by applied statisticians. This reinforces the well-documented but oft-neglected
fact that the Neyman-Pearson theory desideratum of a more or most
powerful size test may be scientifically inappropriate; the same is true for
the criteria of unbiasedness and -admissibility. Although the LR criterion is
not infallible, we believe that it remains a generally reasonable first option
for non-Bayesian parametric hypothesis-testing problems.