The log-rank test is one of the commonly used methods for survival analysis, and is considered the most powerful tool to compare two survival curves under the PH assumption. However, in the IO therapy trials, observed data often present a clear deviation/violation of the PH assumption due to delayed effects, cure rate, crossing hazards, or a mixture of these phenomena [1].
The hazard ratio (HR) has been widely used to evaluate the treatment effect under the PH assumption. However, when this assumption is deviated, the resulting HR estimate as the metric for the treatment effect is difficult to interpret [2]. In an interview, Professor David R. Cox, the originator of the COX proportional hazard model, stated, “Of course, another issue is the physical or substantive basis for the proportional hazards model. I think that’s one of its weaknesses…”[3]. When the PH assumption is violated, a single HR may not be a good estimand or measurement for the treatment difference because the HR can often be hard to understand or interpret without PH [4]. In this case, the HR is not simply an average of the true HR over time, but instead is the weighted average of the HR over time on the log scale [5]. In the Cox-regression model, the weights depend on the censoring distribution and different settings of accrual, follow-up, and early dropout in randomized clinical trials. Thus, this could lead to different trial results and parameter estimates even if the underline survival curves are identical no matter how large the sample sizes might be [5, 6]. In addition, median survival time may not be estimable due to long-term survival. When designing a clinical trial under non-PH, most likely we will mis-specify how the difference between groups varies over time due to a lack of PH, therefore, the HR estimation procedure may not be able to effectively detect a true difference between groups. Thus, the HR estimation procedure is a non-robust measure of the difference between two survival curves under non-proportional hazards [6]. Even when the PH assumption is a reasonable assumption, the HR may not be a useful summary of the treatment difference for decision making due to lack of the reference hazard value and the same HR value may have a completely different interpretation due to different reference hazard values [6]. Even though alternative methods have been suggested to replace the log-rank test for improving the power, without resolving the HR interpretation issue, the estimand associated with the analysis is ambiguous, which clearly deviates from the recommendation in ICH E-9 (R1) [7] due to a violation of the test/estimation coherency [8].
Chappell and Zhu [9] explored many different ways in describing differences in survival curves, including HR, median survival, ratio of landmark survival, and ratio of restricted mean survival time (RMST, the life expectance within a given (restricted) time period). The authors concluded that none of these endpoints is uniformly superior and all should bear consideration. The choice of the method to compare 2 treatment regimens depends on scientific and clinical necessity.
The RMST offers an intuitive, clinically meaningful interpretation without any pre-assumed model assumptions, such as the PH assumption [10, 11, 12, 13]. However, we need to restrict the comparison to some specified interval since the censoring prevents reliable estimation of the unrestricted mean lifetime. From a statistical point of view, the RMST is the mean length of survival time within a specific time window, which can be interpreted as the area under the survival curve within the window, and in practice, it can be viewed as the life expectancy within the specific time window. The procedure for estimating the difference in RMST between two treatment groups is always valid without any model assumptions. The RMST is more stable in comparison with the estimation of the median survival time [3], has a valid and clearly defined estimand, and can produce consistent results between the hypothesis testing and estimation. The RMST captures the survival curve within the considered time window which is more informative when a survival plateau is present on the long term, for example in several immunotherapy RCTs; whereas the median used in HR estimate is unable to detect such a plateau in this right tail of the curve [14, 15, 16].
The RMST provides an absolute measurement based on a scale of time, whereas the HR reflects a relative parameter which does not have any unit. Trinquart, et. al. [17] compared empirically the treatment effects measured by the HR and by the difference (and ratio) of RMST in 54 oncology randomized trials. In summary, on average, the HR provided significantly larger treatment effect estimates than the ratio of RMST (as calculated at the latest follow-up time). The authors recommend RMST-based measures be routinely reported in randomized trials with time-to-event outcomes.
Huang and Kuan [5] did extensive simulations under various scenarios and design parameter setups and compared the log-rank test and RMST-based test methods. When there is an evident separation favoring one treatment arm at most of the time points across the Kaplan-Meier survival curves, the log-rank test is generally a powerful test, but the RMST test has a similar performance. However, when the PH assumption is violated for scenarios where a late separation of survival curves is observed, the RMST-based test has better performance than the log-rank test when the pre-specified truncation time 𝜏 for defining the RMST is reasonably close to the tail of the observed curves.
RMST is robust with good interpretation for any survival distribution. Zhao, et. al. [18] suggested using an RMST curve based on the RMST over time to quantify/evaluate the difference between two RMST curves within a specific time window. As an early noted restriction of RMST due to right censoring, the RMST inference is only available for the time period up to the minimum of the latest follow-up for the two groups. The statistical inference can be obtained for the RMST difference within a prespecified time window. However, the choice of the time window is crucial, and the resulting confidence bands for the difference of RMST curves depends on the choice of this time interval. Similarly, this issue also applies to the case for the simultaneous inference about the difference of two survival curves or hazard ratio.
The integration under the KM curve from the beginning of the study through a pre-specified time point is commonly used to calculate the RMST. However, the KM method shows some limitations in practical applications. First, the curve may not be able to extrapolate to time points beyond the follow-up time. The estimates may have large variance at time points towards the tail due to small numbers of patients at risk. As pointed out by Peto, et al [19] the standard error can be underestimated in this long flat region. The curve is also estimated as a step function, which is biologically unrealistic. These drawbacks of the KM method limit the RMST performance because the survival comparison can only be estimated until the last event time or observation time, and a potential large variance for the estimates is presented at the last time point. In a typical clinical study, some prediction or extrapolation is needed and can be useful, for example, determining when the next interim analysis should be performed, or assessing if additional follow-up time in a study is needed to demonstrate a treatment difference. However, the KM method cannot fulfill this goal and so a parametric method is more desired.
To address these challenges from KM-based RMST method, in this paper we applied a flexible parametric mixture model to estimate the survival curves. Therefore, a dynamic RMST curve is constructed over any given time window of interest for the clinical study. A mixture model can fully take advantage of a parametric form for inference without limitation of the follow-up time. Liao and Liu [20] showed that the mixture models have good flexibility, i.e. the estimated survival curves can be very close to KM curves, to fit survival data from several oncology studies. The paper is organized as follows. In Section 2, we introduce the RMST method, and describe how to derive the dynamic RMST curves from the mixture Weibull models, where the RMST difference or ratio is computed over a range of values to the point of restriction τ, tracing out a curve over time. Three real datasets are used to illustrate the performance of the proposed dynamic RMST curves in section 3. Summary and discussions are provided in section 4 with a conclusion in section 5.