Although no single recommended method exists for estimating thresholds of meaningful within-patient change, in practice researchers tend to use the anchor-based mean approach as the primary method and distribution-based approaches as supportive. Alternatively, researchers tend to prefer the median anchor-based method whenever the COA change scores or anchor-measure distributions are skewed [e.g., 22, 23]. Using data generated for changes in PROMIS PF SF 20a T-scores, our simulation study compared four widely recognized anchor-based and two distribution-based methods for estimating thresholds of meaningful within-patient change under conditions designed to mimic realistic clinical and observational studies.
As expected, among the anchor-based methods, the optimal choice depended on the clinical data characteristics. Although the results supported the common application of mean or median anchor-based methods, the results identified scenarios where the other methods should be strongly considered. Specifically, when ≥ 50% of participants were true responders and PROMIS change scores generally formed a normal distribution, the predictive modeling method performed best overall on controlling bias, increasing precision and accuracy, and exceeding individual measurement errors. Although this method did not always yield the smallest bias on average, its variability around mean estimates was almost the smallest among the anchor-based methods. This high precision was consistent with the simulation finding by Terluin et al. [6] that the 95% CI for the ROC curve was wider in length than that obtained by the predictive modeling method in the setting of 50% improvement prevalence and normal distribution of target COA change. The likely reason for this finding is that both logistic regression methods use the entire sample to locate the threshold estimate based on sensitivity, specificity, or odds, whereas the mean and median methods focus on the group at one anchor level (e.g.,” minimally improved”). Therefore, higher precision (low CV)—especially for larger sample sizes for the two logistic methods—was not surprising.
With < 50% (e.g., 30%) of responders under normal distributions of T-score change, method preferences trended toward mean and median anchor-based methods for the smallest of RBs and satisfactory protection against measurement error most of the time. One major reason for this preference, as shown in Table 3 and Fig. 2, is that the mean and median methods had smaller increases in bias than the two logistic methods for the 30%-improvement group when the 50%-improvement group was used as the reference. At first glance, this finding seemed in conflict with the simulation findings by Terluin et al. [6], that changing the “prevalence of improvement” alone did not affect the estimates of the two logistic-based methods. However, the current study and Terluin et al. [6] applied different simulation conditions. The population percentage of improvement simulated for the anchor-based methods impacted the true threshold or responder definition in the current study, while the “prevalence of improvement” in Terluin et al. [6] may not have matched the underlying responder percentage. In Terluin et al. [6], the true threshold was fixed to 3.5 when the prevalence changed from 50–70%, but in the current study, the true thresholds varied with the population improvement percentage of the PGIC.
For skewed T-score change distributions, the median method and ROC curve method performed best at the conditions of 30% and 50% improvement, respectively. As shown in Table 3 and Fig. 2, this finding was likely related to the smaller effects of positive increases in bias due to skewed distributions and the countereffect of a negative increase on bias due to 30% improvement for the two methods, in contrast to the larger positive effects of both predictors on the mean method and predictive modeling method. In the 70%-improvement condition, the countereffects were observed in the predictive modeling and mean methods, while the combined positive increases further inflated the bias resulting from the other two methods.
Among the conditions investigated, the most suitable for minimizing rRMSE (hence reducing bias and increasing precision overall) was the setting related to a normal distribution (7.0, 3.5), 50% improvement, ρ = 0.70, and n = 300. As a result of the PROMIS IRT-based calibration, the SEM method consistently demonstrated much smaller CV values than the anchor-based methods and the half-SD method; the median within-sample percentages of subjects with individual RCs not greater than the anchor-based estimated thresholds was at least 95%. These findings highlight the importance of selecting a reliable (small random variance in measurement) and valid (adequate relationship with anchor measure) COA in addition to identifying a robust data source (where both responders and nonresponders are well represented) when conducting analyses to identify a meaningful within-patient change threshold. For example, if researchers intend to use interim data cuts of ongoing trials to establish the meaningful within-person threshold, it is sensible to wait until ~ 50% of the subjects can be considered responders, based on multiple anchor measures or external gold standards (where bias tends to be minimal and precision and accuracy tend to be maximized across methods), if feasible for related therapeutic areas. For literature reviews or meta-analyses of meaningful change, greater weight can be placed on thresholds estimated when approximately 50% of the participants were responders. Not surprising, this study’s results further emphasize the need for a strong responsiveness correlation—however, this does not imply that the correlation must be perfect, because the unique value of the target COA (in addition to the anchor measures) is established in theory and qualitatively. To maximize estimation precision, wise decisions must be made with respect to item selection, calibration, and scoring rule (i.e., valid, reliable, discriminative, highly intercorrelated items; raw versus pattern scoring; weekly versus monthly scores; and missing-data rule). As always, a larger sample and normal distribution of target COA change are desirable.
Finally, the half-SD and SEM methods generally underestimated the thresholds in most settings specified. This finding confirmed their roles as supportive estimates, in addition to the RC value, in identifying the minimal value when reporting a range of thresholds.
Limitations And Future Research
Although this study was designed to generalize to typical applications, there are limitations. This research focused on thresholds for detecting improvement in a COA; therefore, the results cannot be easily applied to COA thresholds for use in clinical trials or observational studies aimed at mitigating the progression (worsening) of a condition.
In addition, the correlation between PROMIS change and PGIC was simulated as a Spearman correlation to free the assumptions regarding linear relationship or normal distribution of the target COA change. Readers should be cautious if directly applying these findings to situations with other correlation types (e.g., Pearson).
Another important consideration is that the simulation used a retrospective anchor measure with minimal measurement error (only from random sampling). In practice, retrospective anchors could be subject to additional measurement error due to response-shift bias or recall bias [24]. Fayers and Hays [24] recommend inclusion of both retrospective and concurrent anchors (e.g., global ratings of current severity) in clinical trial designs. Our simulated PGIC values could be considered as change between two administrations of Patient Global Impression of Severity (PGIS) rating scales. However, PGIS change likely would have provided more levels than our simulated PGIC, resulting in use of a different type of correlation. Similar caution would be required in settings using a continuous anchor measure but only two response classes: “responder” versus “nonresponse” (e.g., a biomarker with only one reference cutoff or change in the 22-item Sinonasal Outcome Test using the recommended cutoff of − 8.9 [25]), which would result in more flexibility in correlation computation.
Regardless of anchor measure type (retrospective or concurrent), more measurement error is still possible in practice. This would not only undermine the responder classification but also attenuate the responsiveness correlation [25]. Hence, a correlation corrected for measurement error [25] and sensitivity analyses on the responder classification at different confidence limits of the anchor should be considered in these situations.
Finally, due to computational limitations, the current study did not model the relationship between the baseline score and follow-up change in the target COA and did not allow for varying true thresholds or responder percentages conditioned on baseline scores. These knowledge gaps can be addressed by future studies to facilitate discussions about how to thoughtfully estimate responder thresholds under different clinical data characteristics.