Motivating clinical trial
In 2001, Breast Cancer Trials, known at the time as the Australia and New Zealand Breast Cancer Trials Group, initiated the ANZ 0001 trial15 to explore chemotherapy options for women with advanced breast cancer who were considered unsuitable for more intensive chemotherapy.
ANZ 0001 was a multi-centre open-label trial that randomised participants 1:1:1 to receive intermittent capecitabine at 1000 mg/m2 twice daily on days 1 through 14 every 3 weeks, continuous metronomic capecitabine at 650 mg/m2 twice daily, or cyclophosphamide 100 mg/m2 on days 1 through 14 plus methotrexate 40 mg/m2 and fluorouracil 600 mg/m2 on days 1 and 8 every 4 weeks (CMF). The primary outcome was progression-free survival. It was pre-specified that the two capecitabine arms would be combined for analysis if the difference in their treatment effect had p-value greater than 0.05. Assuming that the capecitabine arms would be combined, 465 participants (155 per arm) would provide enough events to detect, with 80% power at the two-sided 5% significance level, a difference in median progression-free survival of 6 months in the CMF arm versus 8.15 months in the capecitabine arms.
The trial recruited 323 eligible participants over 4 years from July 2001 to June 2005 for an average rate of 6.73 per month, at which point recruitment was stopped by the trial management committee due to diminishing support. At the time of final analysis, 5 years after recruitment began, 7 participants (2.2%) were lost to follow-up for an average of 0.43% compounding per year. Median progression-free survival was 6.1 months in the combined capecitabine arms versus 7.1 months in the CMF arm ( ).
Shortlisted fixed and adaptive designs
As a benchmark to compare the adaptive randomisation designs against, we considered how we would address the ANZ 0001 clinical question today using a fixed design. In the original ANZ 0001 trial, the two experimental arms were combined into a single arm, but we kept them as separate arms because previous simulation studies16, 17 have found adaptive randomisation to be most beneficial in multi-arm trials, and because some adapative randomisation techniques18 are only applicable with three or more arms. We recalculated a sample size suitable for comparing a control arm against two separate experimental arms using the Kim and Tsiatis19 method with Schoenfeld’s formula20. Survival times were assumed to be exponentially distributed, with estimated median overall survival of 6 months in the control arm and 8.15 months in each of the experimental arm. It was assumed that participants would be recruited at an average rate of 6.75 per month, with a 0.5% chance of being lost to follow-up per year.
Assuming a maximum of 10 years for recruitment and follow-up, 552 participants (184 per arm), yielding 335 events, provided 80% power to detect a hazard ratio corresponding to the estimated survival difference at the 5% two-sided significance level. Hence, the fixed design for 3 arms would close to recruitment when 552 participants (184 per arm) have been recruited, and final analysis would occur when 335 events have accrued for each arm-wise comparison (that is, there are at least 335 events among arms 1 and 2 and 335 events among arms 1 and 3).
We shortlisted several adaptive designs for the virtual re-execution of ANZ 0001. Like the fixed design, each adaptive design closed to recruitment after 552 participants had enrolled and performed final analysis when 335 events had accrued for each arm-wise comparison. At final analysis, each experimental arm was declared a success if the posterior probability of its mean survival time being larger than that of the control arm, , was greater than some decision threshold for success, , calculated using the exponential-inverse gamma model as described by Thall et al21. Particpants were randomised to each arm in proportion to its corresponding probability of superioritybased on the interim data available at the time of randomisation. Randomisation probabilities were recalculated every 7 days. We explored several regularisation methods in various combinations to find the settings that would be most suitable for this trial:
- Clip: Probability clipping mandates that the probability of being assigned to any arm cannot fall below a minimum threshold . For example, at , randomisation probabilities < 0.1 become 0.1 while those > 0.9 become 0.911.
- Power transformation: In contrast to the “hard cut-off” of the clip method, power transformation shifts the randomisation probabilities towards equal randomisation by raising all probabilities to the power of tuning parameter , where . At the extremes, results in equal randomisation while results in no regularisation. A commonly suggested setting is 11.
- Burn: Burn-in is the use of equal randomisation for the first patients, followed by adaptive randomisation for the rest11. Given our time-to-event outcome, we apply burn-in for the first events instead.
- Natural lead-in: In contrast to the “hard cut-off” of the burn method, natural lead-in allocation smoothly transitions the tuning parameter from at the start of the trial to another desired value (such as ) as recruitment reaches the maximum sample size2.
- Information-weighted: Information-weighted allocation assigns randomisation probabilities in proportion to the probability of superiority, while also giving additional weighting to arms for which less information is known (essentially arms with smaller sample size)10.
- Protection: Simulation studies by Trippa et al18 found that the power loss of adaptive randomisation can be avoided by “protecting” the control arm, adjusting the randomisation probabilities such that the sample size of the control arm approximately matches the sample size of the largest experimental arm. We use a modification of Trippa et al’s formula. In short, the probability of being randomised to the control is proportional to . This ensures a higher probability of being allocated to the control arm when the sample size of the control arm is smaller than that of the larger experimental arm and a lower probability of being allocated to the control arm when it is already the largest arm.
- Hazard minimisation: Zhang and Rosenberger22 describe a method to calculate the ideal ratio of participants allocated to each arm that minimises the total expected hazard. We use a modification of Trippa et al’s18 method to ensure that final sample sizes approximate the estimated ideal allocation ratios.
Software and simulation settings
Using the R and C++ programming languages23, we wrote custom software capable of simulating a range of different trial designs. Recruitment times were generated according to a Poisson process. Average recruitment rate, survival times, and dropout rates were simulated according to the assumptions described in the previous section.
Selection of shortlisted designs
Table 1 lists the regularisation methods that were explored. Every possible combination of regularisation methods comprised a single design (thus, we explored 3900 different designs in total). Each design was simulated over three main scenarios listed in Table 2:
- Null scenario: True median survival was 6 months in all 3 arms.
- Scenario A: One experimental treatment was effective; true median survival was 6 months in arm 1, 8.15 months in arm 2, and 6 months in arm 3.
- Scenario B: Both experimental treatments were effective; true median survival was 6 months in arm 1, and 8.15 months in arms 2 and 3.
For each scenario, operating characteristics such as average sample size were calculated by simulating the trial up to 100,000 times, in line with United States Food and Drug Administration recommendations24. Operating characteristics were considered desirable if they met each of the following selection criteria:
- Probability of success for each experimental arm was less than 0.025 in the null scenario, analogous to controlling the frequentist type I error.
- Probability of success for arm 2 was greater than 0.8 in scenario A.
- Probability of success for arms 2 and 3 were both greater than 0.8 in scenario B.
Among all designs satisfiying all stated selection criteria, we shortlisted designs that, averaged across scenarios A and B, either maximised power or minimised the number of participants randomised to the control arm.
Virtual re-execution of shortlisted designs
We re-executed each of the shortlisted designs using the real-world data from ANZ 0001. Participants were recruited at their original times, but arm allocations were re-randomised and clinical outcomes were randomly sampled with replacement from among the real outcomes of each respective arm. Bootstrapping was performed using 100,000 replicates. If the list of original recruitment times was exhausted in any virtual re-execution, additional recruitment times were generated according to a Poisson process, with an average recruitment rate equal to that observed for ANZ 0001.