PSP and power

As I occasionally do, I’ve selected a particularly pertinent and/or complicated reader comment to respond to as a self-contained post.  “Mayod” has commented on my post, “Orphans, this could be your chance,” which touched on the statistical issues relevant to including the nine non-Richardson PSP subtypes in clinical trials.   The comment essentially ask for clarification of terms as I and other medical writers use them, especially the term, “statistical power.”

Hi “Mayod”:

First, for the benefit of my other readers, I’ll point out that you are a very prominent academic expert on the philosophy and theory of statistics.  That’s pretty scary, so I’ll avoid trying to match that level of sophistication and simply reply to your question at a level comprehensible to most intelligent people who have never formally studied statistics.  (Full disclosure: I’ve never had a statistics course myself.  I’ve just picked stuff up along the way.)

In the context of a drug trial, the “power” is the ability to exclude the null hypothesis.  The null hypothesis, which we all hope will be disproved by the trial, posits that the drug works no better than placebo. At a more technical level, the study’s power is 1-β, where β is the false negative rate, or the likelihood of failing to identify a true benefit of the active drug relative to placebo.  It’s also called “Type II error.” Typically, β is set at 20%, sometimes 10%.  That would make the power 80% or 90%.     

Another component of a trial’s power is the α, which is the greatest tolerable likelihood of falsely rejecting the null hypothesis, which is concluding that the drug works better than placebo when it really doesn’t.  It’s also called the false positive rate, or Type I error and is typically set at 5%, sometimes 1%.  As a drug trial designer you have to balance the α and the β, meaning that you don’t want to make the false negatives so low that you risk elevating the false positives, and you don’t want to make the false positives so low that you risk elevating the false negatives.

The other number required to determine the trial’s power is the “effect size.”  In a PSP trial, that’s the detectable reduction detectable in the average rate of worsening over the duration of the trial for the active drug group relative that in the placebo group.  For PSP, the effect size is typically set somewhere between 20% and 40%, though we’d all like it to turn out to be much higher than that.  As an example, let’s say the placebo group and active group each start the trial with an average PSP Rating Scale score of 30.  At the end of the trial, the placebo groups has progressed to an average of 40, while the active drug group has progressed to an average of 37.  That’s a difference of 10 points vs 7 points, or a 30% slowing of progression (a 30% effect size).

As a brief aside: The previous paragraph’s use of the word “average,” usually means “mean” for drug trials, but there’s now a movement toward comparing not the two groups’ means, but the frequency among each group of having worsened by a pre-determined amount over the trial period.  Those two frequencies and their confidence intervals are then compared.   That “given amount” is the motivation behind using the “minimal clinically important difference” for that specific medical condition.  The confidence interval is the span of possible results in which 95% (typically; occasionally 90%) of each group’s frequency occurs.  (Nerd alert: The “95%CI” measures the variability of a frequency, just as the standard deviation measures the variability of a mean.)

Getting back to maximizing a study’s power: One way to do that is to choose an outcome measure with as little random “noise” as possible.  Such noise could arise from ambiguous wording in the scoring definitions, poor rater training, inclusion of medically irrelevant items in the scoring, poor fidelity between rating definitions and the true natural history of the disease, and many other factors. 

But another good way to reduce random noise in the results is simply to increase the number of patients in the study.  That’s why the medical literature often expresses the “power” of a clinical trial as the number of patients (the “N”) required to minimize the noise to the point where both the α and the β are acceptable.  Obviously, the lower the N, the greater the power. 

“Mayod,” I hope that answers your question, and as for the rest of you, thanks for powering through this far.

Orphans, this could be your chance

Back in 2023, I posted an explanation of the ten PSP subtypes.  The archetypal subtype, PSP-Richardson syndrome accounts for about half of all PSP and, in contrast to most of the other subtypes, has a rapid progression rate, a validated rating scale, and highly accurate diagnostic criteria.  All of these features have led clinical trial sponsors to maximize their trials’ sensitivity and minimize their costs by restricting admission to people with PSP-Richardson.  But developing better outcome measures for non-Richardson forms of PSP could change that practice.

A big step toward realizing this goal was published last week in the journal Neurology by a group at the Mayo Clinic in Rochester, MN.  Led by first author Dr. Mahesh Kumar and senior authors Drs. Jennifer Whitwell and Keith Josephs, the study found that a good outcome measure for clinical neuroprotection trials in all PSP subtypes was to combine a measure of atrophy by MRI with a measure of clinical disability.  This is a major advance.

The researchers performed brain MRIs at the start and end of a one-year period in 88 people with PSP and 32 age-matched controls.  Of those with PSP, 50 had PSP-Richardson, 18 had “PSP-cortical” (three of the other nine subtypes) and 20 had “PSP-subcortical” (the other 6 of the subtypes).  They had to lump the non-Richardson subjects using their subtypes’ general anatomical predilections because most of the subtypes were too rare to analyze on their own.

Calculating how much each of ten important PSP-involved brain regions had atrophied over the one-year interval allowed the researchers to identify which region(s) might best serve as markers of progression for each of the three groups when coupled with standard clinical measures.  Those measures include such familiar instruments as the PSP Rating Scale and the Unified Parkinson’s Disability Rating Scale’s motor section as well as less familiar scales specific for cognition, gait, eye movement and speech.  All the scales were administered concurrently with each of the two MRIs. 

They expressed the sensitivity to one-year progression not by some abstract statistic, but by the number of patients needed in a double-blind trial to demonstrate with at least 80% certainty that patients on active drug enjoyed a 20% slowing of progression relative to the placebo group. (These specifications are typical for PSP clinical trials.)  The better the measure’s performance, the fewer patients are needed.

And the award for Best Performance by an Outcome Measure in a PSP Neuroprotection Trial goes to . . . a combination of the rate of atrophy of whatever brain region shrinks fastest in the patient’s specific subtype and the PSP Rating Scale score.

The real significance of this study’s result is that using an outcome measure customized to each participant’s PSP subtype could allow trials to enroll not just people with PSP-Richardson, but also those with any of the other subtypes.  That’s because the trial’s measure of success could be to compare each patient’s rate of progression during the trial to that of patients in the placebo group with the same PSP subtype. 

This could double the number of people eligible to enroll in PSP trials, which means cutting the enrollment period in half, with commensurate reduction in costs for the sponsor.  The hybrid measure is more sensitive to progression than the PSP Rating Scale alone, thereby reducing the number of patients required even more. 

Both factors could lower the financial barrier confronting a company hoping to mount a trial for a promising PSP drug.  That may be the most important bottleneck right now in the development of a treatment to prevent or slow the progression of PSP.s

That’s why this news is huge for PSP in general and for the “orphans” in particular.

How much PSP is “important”?

I could use your input right now.  (Actually, I could always use your input, but only occasionally do I ever specifically ask for it.)

A few days ago, I attended a two-day conference in Washington, DC on the tau protein sponsored jointly by the Alzheimer’s Association, The Rainwater Foundation and CurePSP.  A talk on clinical trials in the non-Alzheimer tau disorders mentioned the well-known difficulties in recruiting adequate numbers of patients with rare conditions like PSP.  In the Q/A, I asked if there’s a realistic possibility of reducing the number of patients required for a trial by using a new approach called “personalized endpoints.”  Afterwards, the editor of a journal introduced himself and asked me to write a review article/opinion piece on that issue. I said OK, but now I could use your help.

Here’s the background to my question at the conference, though many of you already know what’s in the next two paragraphs:

The typical Phase II or III clinical trial divides the patients into active treatment and placebo groups.  Trials of chronic, progressive disorders like PSP measure the signs and symptoms for each patient at “baseline,” i.e., the first visit after the screening visit, using a battery of scales and tests.  One of those, which for PSP, almost always the PSP Rating Scale, is deemed the “primary outcome measure.” Other measures of the drug’s effect are called “secondary” outcomes and still others under evaluation for future use are called “exploratory” outcomes. 

At the end of the double-blind period, typically one year for PSP, the battery of outcome measures is repeated, many of them having been repeated at interim visits as well.  Then, for each treatment group, disease progression is measured as the rate of progression (i.e., PSPRS points per year) from baseline to endpoint.  If that difference is less, on average, for the active group than for the placebo group, and is large enough that the likelihood of having occurred by chance is less than 5 percent, then the result is deemed “statistically significant.” If that result is reinforced by similar results in at least some of the secondary outcome measures and if the side effects are justified by the efficacy given the disorder’s severity and availability of other treatments, the drug will then be considered for approval by government regulators.

So what’s the problem?   

First of all, “statistically significant” is not the same as “clinically significant.”  That means that a result too small to make a difference to the individual patient can, because of a study’s large size, reach statistical significance.  The FDA knows this, of course, and relies on secondary outcome measures to verify clinical significance.  But the secondary measures may under-perform statistically, or may measure only a single aspect of the disease such as cognition or balance, or may lie far from the patient’s lived experience, as do, for example, an MRI or a blood test.

Secondly, averaging the entire active drug group and the entire placebo group is a very coarse measure.  That means that demonstrating a given treatment effect with statistical significance requires large numbers of patients and/or a longer study.  Both issues mean more expense for drug companies, which means that fewer drugs will get tested, the trials will be longer, and effective drugs may appear to be ineffective (called a false-negative result or Type 2 error).  None of those things is in anyone’s best interest.

So what’s the solution?

The PSP Rating Scale is far from perfect and various improvements have been published.  While each improves upon the original in some way, none is more sensitive to change.  The only outcome measure confirmed to be more sensitive to change than the PSPRS is the MRI-based measurement of brain atrophy, and that’s too far removed from actual symptoms and disability.  So, we need new study designs that can squeeze more information out of fewer patients.

A more sensitive way to assess a drug’s benefit uses “personalized outcomes.”  That’s where each patient being enrolled is assigned an expected endpoint PSP Rating Scale score based on how much they are likely to progress over the following year according to published research.  Relevant baseline data includes things like age, sex, progression since onset, baseline PSPRS score and subsets thereof, certain MRI abnormalities, and levels of certain chemical markers in the spinal fluid or blood.  At the end of the double-blind period, the active drug and placebo groups are compared with regard to how many patients did better than their own pre-defined expectation.

But how much better is “better”?  Enter a concept called “minimal clinically important difference.” That’s exactly what it says: The smallest change in a test score that corresponds to a change that makes a difference to the patient or family.  The trick is how to obtain this information in some sort of reliable, standardized way.  For PSP, the only attempt to do this to my knowledge was published in 2016 by Dr. Sarah Hewer and colleagues, mostly from Alfred Hospital in Melbourne, Australia.  They mined data from a completed, negative PSP trial that used a secondary outcome measure widely employed in clinical research, the “Clinical Global Impression of Change” scale (CGI-c).  This simple, seven-point scale asks the study neurologist to decide if overall, compared to baseline, the patient is “very much,” “much,” or “minimally” improved; or unchanged; or “minimally, much, or very much” worse.  Hewer et al, calculated the average degree of PSPRS worsening for patients rated by the CGI-c as “minimally worse” over the course of the year of the trial.  (Of course, the CGI-c uses the neurologist’s opinion, not the patient’s or family’s, but hopefully, the neurologist relied on their input.  I certainly always did whenever I completed a CGI-c.)

The minimal clinically important difference in the 100-point PSPRS turned out to be 5.7 points, with a 95% confidence interval (the range encompassing 95% of the patients) of 4.83–6.51.  Now, 5.7 points on the PSPRS represents about six months’ decline for the average patient with PSP-Richardson’s syndrome, with most of the other subtypes declining a little more slowly. 

Finally, here’s my question for you: 

As the seven steps in the CGI-c may be too coarse or too fine, is a six-month decline really the “minimal” worsening that’s important to you?  Or would a smaller or larger decline be the least you’d consider important?  You or a helper can judge this in terms of your overall comfort, or your ability to perform daily activities, or a combination of the two.  Note that I’m asking how many months’-worth of decline is important, not merely noticeable, which I assume would be less.

Please respond in the comments feature or by email: ligolbe@gmail.com.  Thank you!