As I occasionally do, I’ve selected a particularly pertinent and/or complicated reader comment to respond to as a self-contained post. “Mayod” has commented on my post, “Orphans, this could be your chance,” which touched on the statistical issues relevant to including the nine non-Richardson PSP subtypes in clinical trials. The comment essentially ask for clarification of terms as I and other medical writers use them, especially the term, “statistical power.”
Hi “Mayod”:
First, for the benefit of my other readers, I’ll point out that you are a very prominent academic expert on the philosophy and theory of statistics. That’s pretty scary, so I’ll avoid trying to match that level of sophistication and simply reply to your question at a level comprehensible to most intelligent people who have never formally studied statistics. (Full disclosure: I’ve never had a statistics course myself. I’ve just picked stuff up along the way.)
In the context of a drug trial, the “power” is the ability to exclude the null hypothesis. The null hypothesis, which we all hope will be disproved by the trial, posits that the drug works no better than placebo. At a more technical level, the study’s power is 1-β, where β is the false negative rate, or the likelihood of failing to identify a true benefit of the active drug relative to placebo. It’s also called “Type II error.” Typically, β is set at 20%, sometimes 10%. That would make the power 80% or 90%.
Another component of a trial’s power is the α, which is the greatest tolerable likelihood of falsely rejecting the null hypothesis, which is concluding that the drug works better than placebo when it really doesn’t. It’s also called the false positive rate, or Type I error and is typically set at 5%, sometimes 1%. As a drug trial designer you have to balance the α and the β, meaning that you don’t want to make the false negatives so low that you risk elevating the false positives, and you don’t want to make the false positives so low that you risk elevating the false negatives.
The other number required to determine the trial’s power is the “effect size.” In a PSP trial, that’s the detectable reduction detectable in the average rate of worsening over the duration of the trial for the active drug group relative that in the placebo group. For PSP, the effect size is typically set somewhere between 20% and 40%, though we’d all like it to turn out to be much higher than that. As an example, let’s say the placebo group and active group each start the trial with an average PSP Rating Scale score of 30. At the end of the trial, the placebo groups has progressed to an average of 40, while the active drug group has progressed to an average of 37. That’s a difference of 10 points vs 7 points, or a 30% slowing of progression (a 30% effect size).
As a brief aside: The previous paragraph’s use of the word “average,” usually means “mean” for drug trials, but there’s now a movement toward comparing not the two groups’ means, but the frequency among each group of having worsened by a pre-determined amount over the trial period. Those two frequencies and their confidence intervals are then compared. That “given amount” is the motivation behind using the “minimal clinically important difference” for that specific medical condition. The confidence interval is the span of possible results in which 95% (typically; occasionally 90%) of each group’s frequency occurs. (Nerd alert: The “95%CI” measures the variability of a frequency, just as the standard deviation measures the variability of a mean.)
Getting back to maximizing a study’s power: One way to do that is to choose an outcome measure with as little random “noise” as possible. Such noise could arise from ambiguous wording in the scoring definitions, poor rater training, inclusion of medically irrelevant items in the scoring, poor fidelity between rating definitions and the true natural history of the disease, and many other factors.
But another good way to reduce random noise in the results is simply to increase the number of patients in the study. That’s why the medical literature often expresses the “power” of a clinical trial as the number of patients (the “N”) required to minimize the noise to the point where both the α and the β are acceptable. Obviously, the lower the N, the greater the power.
“Mayod,” I hope that answers your question, and as for the rest of you, thanks for powering through this far.