>> Paul Wakim: Hello,
I'm Paul Wakim.
I'm chief of Biostatistics
and Clinical Epidemiology
Service
at the NIH Clinical Center
and this is the last part,
part five,
of the module
on hypothesis testing.
And in this segment, we're going
to talk about two topics
that are,
kind of, important
and I thought
I'll mention them very briefly.
And that's what
we're going to go over.
The superiority, non-inferiority
and equivalence test,
and the multiple comparisons,
also called multiplicity
adjustment.
So, let's get started.
Superiority. When we talk
about superiority,
we really mean
going either way.
So, even if you're talking
about medication and placebo,
we call it a two-sided --
it's superiority,
it's a two-sided.
So, it's possible
that the medication
will be better than placebo,
or we allow the placebo
to be better than medication.
So, in superiority,
the clinical hypothesis
is that
the experimental treatment
is more effective than
the control treatment and --
or the other way,
theoretically.
But we're hoping
that the experimental
is going to be more effective.
That's the clinical hypothesis.
So, when do we do
a superiority trial
is when we want to find
that the new medication
is better than placebo,
or better than treatment,
as usual.
The statistical hypothesis,
in more technical terms --
the null hypothesis is that
the experimental is equal
to the control
and the alternative hypothesis
is that the experimental
is different from the control.
So, again,
we allow it either way.
And we expect, or hope,
to reject the null hypothesis
in favor of the alternative.
We're hoping
that the new medication
is going to be better than
the old or than the placebo.
So, here's a --
kind of a illustration
of what we're talking about.
So, if the difference is zero --
that means no difference,
basically on the right
is good -- is good outcome.
And I'm going to show
you confidence 95 percent --
95 percent confidence interval
around the difference.
So, we're talking
about difference here.
And zero difference
mean no difference.
So, you look
at the confidence --
you do your clinical trial,
you get the confidence interval,
and if you get
a confidence interval
that's to the right of zero,
then you call it superior.
Again, remember we talked about
clinical minimum
important difference
and keeping that aside
for a minute.
But in this context,
if it is to the right of zero,
we say it is superior.
If it is to the left of zero,
we say it's inferior,
or we say
the placebo is superior.
And if it covers zero,
it's inconclusive.
And so, these are
the four possibilities
that we can see
in superiority trials.
In non-inferiority trials
the experimental treatment is --
we say is not less effective
than the control treatment.
So, for example --
I mean --
typical example
is generic drugs.
We want to show
that the generic drug
is not less effective
than the brand drug.
Or a very similar drug
is not less effective
than an approved drug.
And that's when we would do
a non-inferiority trial.
The statistical hypothesis
is that --
the null hypothesis
is the experimental
is less than
the control minus delta.
So, delta now is this difference
that makes it --
if it is within the --
the difference is within delta,
we say, well, the difference
is not really that big.
It's --
we can assume it's the same
if it is within less
that the delta.
Delta is called
the non-inferiority --
the margin --
non-inferiority margin.
So, if it is within that delta
we say
it's pretty much the same,
or it's not less inferior.
But if the difference
is more than that,
then we say it is inferior.
So, the alternative it's that
the experimental is greater
than the control minus delta.
And again, we expect
and hope to reject H0
in favor of the alternative.
And in here,
we have the delta at zero
and we also have
the delta at minus --
the big delta
as capital delta --
as at minus little delta,
the non-inferiority margin.
And so, what we're saying is,
if the confidence interval
is above zero,
we say it's superior, fine.
And if it below the minus delta,
we say it's not --
it's inferior,
because it's less
than the control
by an important difference,
which is the little delta.
But if the confidence interval
is still
on the right of the minus delta,
we say it's non-inferior.
If it covers everything
it's as an inconclusive.
But now, think about it
this way, it is --
the confidence interval
is complete to the left of zero
but completely to the right
of minus delta.
The conclusion
is it's non-inferior
because we started the trial
saying you know,
if the difference is within
the small delta, we still --
it's still just as effective.
So, that's called non-inferior.
The question mark is although
is because the whole confidence
interval is to the left of zero.
And if it covers
the minus delta,
we say it's inconclusive,
although,
the whole confidence interval
is to the left of zero.
So, these are possibilities
in non-inferiority trials.
Equivalence trails say
the experimental treatment
is just as effective
as the control treatment.
So, we have two therapies,
or two medications
and we wanted to do comparative
effectiveness research,
and we want to show that --
are they pretty much the same?
As effective as each other,
you know, to --
are they as effective?
That's really what we're asking.
In statistical terms,
the null hypothesis says
that the experimental
is either clearly worse
or clearly better.
That's the null.
But the alternative it's that
the experimental is
within a range of minus
to plus delta,
and they are about as effective.
And it doesn't matter
which one --
if we have experimental
and control
or just two medications.
And we hope to reject H0
in favor of the alternative.
And so here we have the zero
of no difference,
but we have a minus delta
and a plus delta.
And if the confidence interval
is within the minus delta
and delta,
we say they're just about
as effective.
The two treatments
are just about as effective.
So, this is equivalent.
This is equivalent.
It was a question mark
because it says,
even though it's clearly
to the right
or to the left of zero,
it's still
considered equivalent.
To the right -- completely
to the right is superior.
Completely to the left
is inferior.
And if it covers several
thresholds, it's inconclusive.
It's also inconclusive if it
covers the minus delta threshold
and the plus delta, even though
it's complexly to the right
or to the left of zero.
And so, these are
the possible conclusions.
Okay, so, that's for
superiority, non-inferiority,
and equivalence tests.
I want to now switch
to completely different topic,
and that's multiplicity
adjustment.
And the first time
I heard this analogy
with the dart board,
is -- was from --
is from Michael Proschan --
who is also
a biostatistician at NIH.
He made this analogy
with multiplicity.
The dart board game
with multiplicity adjustment.
So, here's the analogy.
Think of two players
playing this game.
And you have player one
that starts to throw darts
at this board.
And this is what happens.
So, the player one comes
and throws darts,
and this is what they --
he hits these places. Okay?
And then after 20 shots
he hits the bullseye.
After 20 shots
he hits the bullseye.
That's player one.
Then comes player two.
And after one shot,
hits the bullseye.
Would you say that player one
is just as good
at this game as player two?
They both eventually
hit the bullseye.
But would you say they're as
good at this game as each other?
I would think no. the first one,
it took him --
him or her 20 shots
to hit the bullseye.
The second one just after
one shot, hit the bullseye.
It's the same idea
with multiplicity adjustment.
Remember I told you there are
two ways to get small p-values?
Guaranteed. And one way was just
to increase the sample size
and eventually you'll get
a small p-value.
Well, the second way
is just keep on trying
different statistical tests
with different assessments,
with different subgroups --
keep on trying test and tests
and tests -- statistical tests,
and eventually you're going
to find a small p-value.
So, you're just going to
throw darts and eventually,
just by pure luck of chance,
you're going to hit
the bullseye.
But it doesn't mean
that you have really
an important finding.
So, in the words of Motulski
who is a author who write
in very plain English --
statistical concept --
he basically says
in these two books,
"If you calculate many p-values,
some are likely to be small
just by random chance.
Therefore, it is impossible
to interpret small p-values
without knowing how
many comparisons were made."
If you make 13 independent
comparisons,
with all null hypotheses true --
in other words,
if you make 13 independent
comparisons with no difference,
there's about
a 50 percent chance
that one or more p-values
will be less than .05 --
that you will find
a statistically significant --
it goes back to --
remember those diagrams
with the 1,000 clinical trials.
Eventually, some of them
are going to be statistically
significant -- same idea.
So, with multiple comparisons
it's easy to be misled
by small p-values.
So, what does it mean
to do multiplicity adjustment?
To do a multiplicity adjustment
means you have to adjust
the p-value
to account for the multiple
tests that you have done.
So, basically,
you have to raise the p-value
that you got
or lower the threshold.
But since we're getting away
from threshold,
you have to raise the p-value.
So, a situation that need
multiplicity adjustment
is if you have
multiple endpoints,
multiple time point, subgroups,
geographical areas,
predictions,
geographical groups,
ways to select variables
in multiple regression,
methods of pre-processing data,
and multiple
statistical analyses.
All of these are examples
where you need
multiplicity adjustment.
When you do, basically,
several tests --
whatever those tests are --
you need to adjust
for multiplicity.
So, here's an example.
If you have two endpoints,
drug use and retention --
so these are two endpoints.
They're both important
in illicit drug treatment,
drug use --
you want to lower drug use,
and you want
to keep retention high.
Each endpoint is tested
at the 5 percent alpha level.
The experimental treatment
is considered beneficial if,
either or both
endpoints are significant.
So, without multiplicity
adjustment --
if you don't do this adjustment,
the chance of the treatment
being found beneficial,
when it is not --
type I error --
can be as high as 10 percent,
not 5 percent,
because either one
of those endpoints
have to be significant.
So, multiplicity adjustment
is a way to control
for false positive conclusions.
Claiming you've got a --
you've got a positive results
when you don't.
It's a way to control
the familywise
or study-wise rate
of false positive conclusion.
So, the key words --
I'm sorry --
the key words
are familywise or study-wise.
So, when do you do multiplicity
adjustment?
When is it necessary?
This is my personal opinion now.
You do it formally
using methods that --
there are many, many methods
for adjusting p-values
and they have their pluses
and minuses.
So, that's formally.
Whenever there are more
than one primary endpoint --
so it's a primary endpoint,
there are several,
and more than
two treatment conditions,
more than one dose
versus placebo,
or more than one time point.
So, I'm going to give you
an example of when it is needed
because it's
the primary analysis.
It's the main objective --
the purpose of the trial
and this is --
if you look at the abstract
of this paper,
it says the primary --
its objective is to test
the effectiveness of two
interventions
compared to usual care,
in decreasing
attitudinal barriers
to cancer pain management,
decreasing pain intensity,
and improving functional status
and quality of life.
So, right there, the primary
objective of the clinical trial
is to do two interventions
versus usual care. Okay?
And then you've got
four primary endpoints:
decreasing attitudinal barriers
to cancer pain management,
decreasing pain intensity,
improving functional status
and quality of life.
So, in this case,
because it's a primary analysis,
you need to do it formally.
Informally,
whenever there are more
than one secondary analysis,
including subgroup analyses.
So, when you're doing secondary
analysis, subgroup analysis --
again,
it's my personal opinion --
the way I like to do it
when it is informal --
what I mean by informal,
you basically tell the reader
in the paper,
you know, look, we're --
obviously from this table
you can see --
remember that table
with so many subgroups --
you basically highlight the fact
that you've done
a lot of subgroup analyses --
here are the confidence
interval, here are the p-values,
but, kind of, warn the reader
that these p-values
have to really,
mentally, be adjusted.
In other words, if you're really
going to use a threshold,
which you shouldn't,
lower the threshold.
Make it harder to --
if you see a p-value of .03
when you've done 20 tests,
don't take it too seriously.
I mean, that's the bottom line.
So, when is multiplicity
adjustment not necessary?
When all the primary endpoints
have to be statistically
significant in order
to claim treatment benefit,
EG to get FDA approval.
Again, this a recommendation
from the European agency
in 2002,
which is way, way before
the ASA 2019 recommendation.
It talks about
statistically significant.
And it goes back also
to the point that wen need
to make changes to the thinking
-- institutional changes.
And the example
of the previous example,
we say
the experimental treatment
is considered beneficial only is
both illicit drug use reduction
and retention are found
to be statistically significant.
If that's -- you need both to
claim efficacy or effectiveness,
then you don't need to do
a multiplicity adjustment.
So, here are the references.
In summary, in a superiority
trial the objective is to show
that one treatment
is more effective than another.
In a non-inferiority trial,
the objective is to show
that one treatment
is not less effective
than another.
In an equivalence trail
the objective is to show
that two treatments are not
too different in effectiveness.
And, multiplicity adjustment
is a way to control
for overall false-positive
conclusions.
And my questions to is,
when is multiplicity
adjustment necessary?
Why is multiplicity
adjustment necessary?
And when is multiplicity
adjustment not necessary?
Thank you for watching.