Performance evaluation of interim analysis in bioequivalence studies.

doi:10.21203/rs.3.rs-3820940/v1

Download PDF

Research Article

Performance evaluation of interim analysis in bioequivalence studies.

https://doi.org/10.21203/rs.3.rs-3820940/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 24 May, 2024

Read the published version in Therapeutic Innovation & Regulatory Science →

You are reading this latest preprint version

Under current bioequivalence guidelines in Japan, it is mandatory to establish bioequivalence using a single pivotal study. Clinical trials with limited resources usually have a pre-defined maximum permissible number of participants. In this manuscript, we considered a trial design that would allow for bioequivalence evaluation at an interim analysis in which the total number of participants takes into account the resource constraints. Then, available options at the interim analysis are group sequential designs and adaptive designs, A comparison of the performance of the two methods under a fixed maximum participant number has not been conducted thus far. So we examined which method should be used by conducting a simulation study. Since bioequivalence is expected to be achieved at the interim analysis, a study design using a Pocock-type alpha spending function is preferrable. Simulation results using a Pocock-type alpha spending function showed similar performance between group sequential and adaptive designs. Consequently, due to statistical and operational complexity, it is preferable to choose group sequential designs for bioequivalence study in Japan.

Japan guideline for bioequivalence study

group sequential design

adaptive design

Bioequivalence (BE) studies are a type of clinical trial conducted to compare the bioavailability of two products that are using the same active ingredient or molecule and verify how similar they are to each other. BE studies often utilize a cross-over study design for several compelling reasons. Firstly, this approach minimizes inter-subject variability by enabling each participant to serve as their own control. Consequently, the influence of individual differences (inter-subject variability) on the study results is greatly reduced. Secondly, given that each subject acts as their own control, fewer participants are necessary to attain the required statistical power. Lastly, this design enhances precision by mitigating the impact of variability within the study population. This heightened precision allows for more accurate and dependable conclusions to be drawn regarding bioequivalence.

In the context of BE studies, one of the simplest and most commonly employed designs is the 2x2 crossover design. A common method for bioequivalence assessment involves comparing different formulations of the same drug by analyzing pharmacokinetic parameters like area under the blood concentration curve (AUC) and maximum blood concentration (Cmax). The goal is to determine whether the observed differences between these formulations are statistically significant. Both the FDA and EMA recommend the use of single-dose 2-way crossover studies, typically involving at least 12 healthy volunteers, as a general practice for BE studies. In 2020, Japan implemented a significant revision to its guidelines for BE studies of generic products (PMDA, 2020). Prior to this revision, there existed various opportunities for BE evaluation, including the utilization of add-on subject studies without the need for multiplicity adjustment. However, post-revision, a notable shift occurred in the regulatory landscape. There emerged a stringent requirement to control the type 1 error, ensuring that it did not exceed a 5% threshold (one-sided) in a single pivotal BE study. As a consequence it is not acceptable to use pilot study data in the BE assessment (Q-10 (A)). Similarly, analyzing the pooled data obtained in the pivotal study and add-on subject studies conducted separately from the pivotal study is also not acceptable (Q-11 (A)). However, it is acceptable to design a protocol with the acquisition of additional data based on the results of interim analysis (Q-11 (A)). Taking these key points into consideration, the possible trial designs are either a group sequential design (GSD) or an adaptive design (AD).

Several manuscripts related to two-stage design in BE studies have been published. Potvin et al. (2008) evaluated four methods for AD using assumed Geometric Mean Ratio (GMR) and actual coefficient of variation (CV) at BE evaluation of interim analysis in BE studies. The two-stage designs discussed here offer insurance against incorrect CV use during planning, resulting in an approximately 20% increase in average sample size when the planned CV is accurate. Karalis and Macheras (2013) selected the best method selected in Potvin et al. (2008), but used the actual GMR and CV at the interim analysis for sample size re-estimation of AD and set an upper limit of sample size. The simulation results in two-stage design ensures that the type 1 error rate remains below 5%, likely due to the inclusion of an upper sample size limit. Fuglsang (2014) investigated the impact on type 1 errors and powers in methods introduced by Potvin et al. (2008) under various futility rules. Although these manuscripts focused only on the results of adaptive design, Kieser and Rauch (2015) compared GSD with AD, and showed the findings that powers in GSD are similar to those in AD but average sample sizes in GSD were fewer than those in AD. However, because the maximum sample sizes and the number of subjects targeted for interim analysis differed between the GSD and AD evaluated, it is necessary to compare their performance under matched conditions.

In actual clinical trials, it is anticipated that the maximum number of subjects for the trial is predefined due to limited resources. The maximum allowable number of subjects in a BE study is typically smaller than that in Phase 2 or Phase 3 studies. Our manuscript takes these circumstances into consideration and evaluates the performance of GSD and AD in BE studies, aiming to examine which method should be used. BE study designs allow for the possibility of declaring BE and stopping the study at the interim analysis timing if the actual within-subject variability is lower than expected or if the GMR is closer to 1 than anticipated.

In Section 2, we describe the methodology of GSD and AD applied for BE studies. Section 3 covers extensive simulation studies that reflect the scale of typical small-scale BE studies. We provide a brief discussion in Section 4.

When AUC and Cmax follow a log-normal distribution, the BE criterion for each parameter is defined as a ratio of the population geometric means of the test product and reference product, falling within the range of 0.80 to 1.25. BE is declared for a product if the 90% confidence interval of the difference in the average values of logarithmic parameters between the test and reference products is contained within the acceptable range of log(0.80) to log(1.25).

When the population geometric means before log-transformation of the parameters for evaluation of BE in the test product and products are described ${\mu }_{T}$ and ${\mu }_{R}$, the hypotheses of the BE test are :

$${H}_{0}: {\mu }_{T}/{\mu }_{R}\le {\theta }_{1} \text{o}\text{r} {\mu }_{T}/{\mu }_{R}\ge {\theta }_{2}$$

$${H}_{1}: {\theta }_{1}<{\mu }_{T}/{\mu }_{R}<{\theta }_{2}$$

The bioequivalence margins (${\theta }_{1}$, ${\theta }_{2}$) are ${\theta }_{1}=0.80$ and ${\theta }_{2}=1.25.$The null hypothesis can be evaluated with two one-tailed t-tests at a 5% significance level as follows,

$$\frac{\text{log}\left({\theta }_{2}\right)-(\stackrel{-}{{X}_{T}}-\stackrel{-}{{X}_{R}})}{s\sqrt{2/n}}>t(1-\alpha ,n-2)$$

$$\frac{\left(\stackrel{-}{{X}_{T}}-\stackrel{-}{{X}_{R}}\right)-\text{l}\text{o}\text{g}\left({\theta }_{1}\right)}{s\sqrt{2/n}}>t\left(1-\alpha ,n-2\right)$$

$\stackrel{-}{{X}_{T}}, \stackrel{-}{{X}_{R}}$ are means of the test and reference products, $\alpha$ is significant level, $n$ is total sample size and $s$ is standard deviation of within-subject error.

In this manuscript we consider study designs that allow for the termination of the study if the BE criteria are met at the interim analysis. In the following section, we will explain the application of GSD and AD to the study design.

2.1 Group sequential design

Given the multiple hypothesis tests conducted in GSDs, it becomes necessary to adjust the significance level for each analysis to maintain the overall significance level 𝛼. These repeated analyses incorporate data from earlier interim analyses, resulting in correlated test statistics within GSDs. Various strategies exist for determining the interim-wise significance levels in GSDs. An early approach, as proposed by Pocock (1977), is to utilize the same level for all analyses. Another option is to select more stringent significance levels at earlier time points and less stringent levels at later stages, a concept put forth by O'Brien and Fleming (1979). Additionally, predefined alpha-spending functions for various fractions of the total sample size, as outlined by Lan and DeMets (1983), provide diverse methods for establishing appropriate local levels. The spending functions which approximate Pocock and O’Brien-Fleming designs have the following forms:

Pocock-type function: $\alpha \times \text{log}\left((1+\left(e+1\right)t)\right)$ (2)

O’Brien-Fleming-type function: $\bullet 2\left(1-{\Phi }\left(\frac{{{\Phi }}^{-1}\left(1-\frac{\alpha }{2}\right)}{\sqrt{t}}\right)\right)$ (3)

α is the overall significance level of the study, $e$ is Napier’s constant and ${\Phi }$ is the cumulative standard normal distribution function. BE is evaluated at the time of an interim analysis using the appropriate significance level for the t-test in (1).

2.2 Adaptive design

Adaptive designs allow a flexible modification of design characteristics during an ongoing study while at the same time controlling the overall type I error rate. Adaptive designs offer the option of mid-course sample size recalculation based on interim results. There exist various approaches to construct adaptive designs that control the type I error rate also in case of sample size recalculation. One standard method is the inverse normal approach (Cui L et al., 1999; Lehmacher W and Wassmer G, 1999; Kieser and Rauch, 2015). In essence, it transforms p-values (${p}_{1}$ and ${p}_{2}$) from each stage of a two-stage design. When these p-values follow a uniform distribution under the null hypothesis, the inverse normal transformation converts them into standard normal random variables. For the inverse-normal combination test (Wassmer and Brannath, 2016), the test statistics ${Z}_{1}$ at the interim analysis and ${Z}_{2}$ at the final analysis are

$${T}_{1}^{*}: {Z}_{1}={{\Phi }}^{-1}(1-{p}_{1})$$

$${Z}_{2}={{\Phi }}^{-1}(1-{p}_{2})$$

Respectively, where ${T}_{1}^{*}$ at the interim analysis, ${p}_{1}$ and ${p}_{2}$ are the p-values at the interim and final analysis respectively. Then, the overall test statistics $Z$ is:

$${T}_{2}^{*}: Z=\sqrt{w}{Z}_{1}+\sqrt{1-w}{Z}_{2}$$

In the manuscript, we set weight of $w$ as $w=\sqrt{{n}_{1}/({n}_{1}+{n}_{2})}$. $Z$ follows a standard normal distribution under the null hypothesis and critical values used in the GSD can also be used in the AD. Consequently the probability of terminating the study at the interim analysis in the AD is same as that in the GSD.

Sample size re-estimation is a characteristic feature of AD, and involves adjusting the sample size during an ongoing clinical study based on the accumulating data to preserve or increase power. The conditional power that significance is achieved at the second stage conditioned on the test statistics at the first stage (at the interim analysis) is:

$$\text{Pr}\left({T}_{2}^{*}>{c}_{2}∣{T}_{1}^{*}\right)=\text{Pr}\left(\sqrt{w}{{\Phi }}^{-1}\left(1-{p}_{1}\right)+\sqrt{1-w}{{\Phi }}^{-1}\left(1-{p}_{2}\right)>{c}_{2}∣{p}_{1}\right) =\text{Pr}\left\{{p}_{2}<1-{\Phi }\left(\sqrt{\frac{{n}_{1}+{n}_{2}}{{n}_{2}}}{c}_{2}-\sqrt{\frac{{n}_{1}}{{n}_{2}}}{{\Phi }}^{-1}\left(1-{p}_{1}\right)\right)∣{p}_{1}\right\}$$

where ${c}_{2}$ is critical value at the final analysis. From (5), we can consider the significance level at stage 2 (${\alpha }^{*}$) as:

$$1-{\Phi }\left(\sqrt{\frac{{n}_{1}+{n}_{2}}{{n}_{2}}}{c}_{2}-\sqrt{\frac{{n}_{1}}{{n}_{2}}}{{\Phi }}^{-1}\left(1-{p}_{1}\right)\right).$$

In other words, conditional power can also be calculated using the following formula.

$${F}_{t}\left(\frac{\text{log}\left(1.25/\text{G}\text{M}\text{R}\right)}{{\sigma }_{W}\sqrt{2/{n}_{2}^{*}}}-{t}_{1-{\alpha }^{*},{n}_{2}^{*} },{n}_{2}^{*}-2\right)-{F}_{t}\left(\frac{-\text{log}\left(1.25\times \text{G}\text{M}\text{R}\right)}{{\sigma }_{W}\sqrt{2/{n}_{2}^{*}}}+{t}_{1-{\alpha }^{*},{n}_{2}^{*} },{n}_{2}^{*}-2\right)$$

GMR is calculated from the data at the interim analysis, and we search for ${n}_{2}^{*}$ where the following conditional powers exceeds $1-\beta$ and is no larger than the maximum allowable sample size ${n}_{2}^{\text{m}\text{a}\text{x}}$ (${n}_{2}^{\text{m}\text{i}\text{n}}\le {n}_{2}^{*}\le {n}_{2}^{\text{m}\text{a}\text{x}}$).

3.1 Simulation study design

To understand the performance of the GSD and the AD in a small-scale clinical study we conducted an extensive simulation study. To simplify the discussion, the simulations consider only a single endpoint, i.e. AUC or Cmax. Also, we decided to set up GMRs below 1 because it is likely to show symmetrical simulation performance around 1 (i.e. We could have used 0.8 for GMR and obtained equivalent results to when the ratio is 1.25.). We calculated the sample size ($n$) to ensure 90% of power under various true GMRs (GMR*) and CV (CV*) in Table 1.

Table 1. Sample size (n) to ensure 90% of power under various GMR* and CV*.

GMR*	CV*
GMR*	0.2	0.25	0.3	0.35	0.4
0.9	52	78	110	146	186
0.95	26	38	52	70	90
1	20	30	40	54	68

As mentioned in Section 1, BE must be proven in the one study, so the GMR (GMR*) and CV (CV*) as the study assumption are estimated a little conservatively and the sample size is calculated. We then assume a situation where the true GMR (GMR0) is close to 1 or the CV (CV0) is small, and BE is expected to be achieved at the interim analysis point. With this in mind, we set up two scenarios. Scenario A includes incorrect assumed CV (CV*≠CV0) but correct assumed GMR (GMR*=GMR0) for sample size calculation. Scenario B includes incorrect assumed GMR (GMR*≠GMR0) but correct assumed CV (CV*=CV0) for sample size calculation. Scenario C includes incorrect assumed GMR (GMR*≠GMR0) and incorrect assumed CV (CV*≠CV0) for sample size calculation.

[Scenario A]

GMR*=GMR0 = 0.9, 0.95, 1
CV*=0.3, 0.4
CV0 = 0.2, 0.25, 0.3, 0.35, 0.4

[Scenario B]

CV*=CV0 = 0.3, 0.4
GMR*=0.9, 0.95, 1
GMR0 = 0.9, 0.95, 1

[Scenario C]

GMR*=0.9, 0.95, 1
GMR0 = 0.9, 0.95, 1
CV*=0.3, 0.4
CV0 = 0.2, 0.25, 0.3, 0.35, 0.4

In simulation, we compared results from three methods for each simulation: fixed design (FD), GSD and AD. Pocock-type and O’Brien-Fleming-type alpha spending functions proposed by Lan and DeMets (1983) is used for GSD and AD, and interim analysis is planned at 50% of the original sample size (${n}_{1}$). The sample size for GSD after interim analysis (${n}_{2}=n-{n}_{1}$) doesn’t change, but the sample size for AD after interim analysis (${n}_{2}^{*}$: ${n}_{2}^{\text{m}\text{i}\text{n}}\le {n}_{2}^{*}\le {n}_{2}^{\text{m}\text{a}\text{x}}$) is recalculated on conditional power exceeding 80% or 90%. Considering the available resources, the maximum allowable sample size ${n}_{2}^{\text{m}\text{a}\text{x}}$ is ${n}_{2}$ and ${n}_{2}^{\text{m}\text{i}\text{n}}$ is 2. For each simulation, we have evaluated total power (%) and average sample size among three methods of FD, GSD and AD.

3.2 Simulation results

First, we discuss simulation results using Pocock-type alpha spending functions. Figure 1 summarizes overall powers and average sample sizes by method (FD, GSD and AD (CP = 80%, 90%)) in Scenario A (Pocock). Figure for powers at the interim analysis is attached in the appendix (Similar figures for other simulations scenarios are also attached in the appendix). Overall power and average sample size are at their highest in FD, those in GSD are almost similar to those in AD (CP = 80%, 90%). Figure 2 also summarizes overall powers and average sample sizes by method in Scenario B (Pocock). Similar to Fig. 1, overall powers and average sample sizes are the highest in FD, those in GSD are almost similar to those in AD (CP = 80%, 90%).

From another perspective, we consider the results when the true CV0 was smaller than the assumed CV* (e.g., CV0 = 0.2 & CV*=0.4 in Fig. 1) or when the true GMR0 was closer to 1 than the assumed GMR* (e.g., GMR0 = 1 & GMR*=0.9 in Fig. 2). At this time, the powers in FD, GSD, and AD (CP = 80%, 90%) are almost the same, but the average sample size in FD is larger than that in GSD and AD (CP = 80%, 90%), and the difference became more variable. In Fig. 3 for overall powers and average sample sizes by method in Scenario C (Pocock), the findings are generally supported by results under incorrect assumed GMR and incorrect assumed CV (e.g. CV0 = 0.2 & CV*=0.4 & GMR0 = 1 & GMR*=0.9 in Fig. 3). This is thought to be because the probability of achieving BE has increased at the time of the interim analysis.

Next, we focus on simulation results using O’Brien-Fleming-type alpha spending functions. Figure 4 summarizes overall powers and average sample sizes by method in Scenario A (O’Brien-Fleming). Overall powers are the highest in FD, almost the same in FD and GSD. The next highest was AD (CP = 90%), followed by AD (CP = 80%). Average sample sizes are the highest in FD, followed by GSD, AD (CP = 90%) and AD (CP = 80%). Figure 5 also summarized overall powers and average sample sizes by method in Scenario B (O’Brien-Fleming). The order of overall powers and average sample size for each method was the almost same as in Fig. 4. The similar results were obtained when the true CV0 was smaller than the assumed CV*, or when the true GMR0 was closer to 1 than the assumed GMR*. In Fig. 6 for overall powers and average sample sizes by method in Scenario C (O’Brien-Fleming), the findings are generally supported by results under incorrect assumed GMR and incorrect assumed CV. However, in some cases (e.g. CV0 = 0.2 & CV*=0.4 & GMR0 = 1 & GMR*=0.9 in Fig. 6), the powers in FD, GSD, and AD (CP = 80%, 90%) are almost the same, but the average sample size in FD is larger than that in GSD and AD (CP = 80%, 90%) because the probability of achieving BE has increased at the time of the interim analysis.

We shall delve into the reasons for the difference of the performance between Pocock-type alpha spending functions and O'Brien-Fleming-type alpha spending functions. This can be attributed to the more stringent BE criteria at interim analysis under the O'Brien-Fleming approach compared to the Pocock approach. Consequently, in the O'Brien-Fleming framework, there is a higher likelihood of the GSD proceeding to the final analysis stage. On the other hand, when calculating the CP of the AD based on interim analysis data, it becomes more attainable to achieve CP of 80% or 90%. This results in an increased occurrence of scenarios where ${n}_{2}^{*}<{n}_{2}^{\text{m}\text{a}\text{x}}$. Consequently, this leads to a small reduction in both the overall powers and average sample size in AD.

Under current bioequivalence guidelines in Japan, it is mandatory to establish bioequivalence using a single pivotal study. Also, there is typically a predetermined maximum allowable number of subjects in clinical trials with limited resources. In this manuscript, we set the total number of subjects in the clinical trial conservatively, taking into account resource constraints, and considered a trial design that would allow for bioequivalence evaluation at an interim analysis.

The results of our simulation study indicated that when using the O'Brien-Fleming-type alpha spending function at the interim analysis, both the total power and average number of subjects in the group sequential design tended to be higher than that of the adaptive design. On the other hand, bioequivalence is expected to be achieved at the interim analysis point, so the study design using a Pocock-type alpha spending function would be preferred.

Simulation results for the Pocock-type alpha spending function showed little difference in performance between the group sequential design and the adaptive design. Therefore, considering the statistical and operational complexity, there may be preferable to choose group sequential designs for bioequivalence study in Japan.

Author contributions

All of the authors listed made substantial contributions to the analysis and interpretation of the data described in this paper. All of the authors commented on and revised previous versions of this manuscript before reviewing and approving this final version. The authors agree to be accountable for all aspects of the work, ensuring that any questions related to the accuracy or integrity of the work are appropriately investigated and resolved.

Funding statement

The research was not conducted with a research fund from UCB Japan Co. and UCB S.A and University of Tsukuba which the authors belong to. The opinions expressed in this manuscript are solely those of the authors and do not express the views or opinions of our companies.

Conflict of Interest statement

No potential conflicts were declared.

Potvin, D., DiLiberti, C.E., Hauck, W.W., Parr, A.F., Schuirmann, D.J. and Smith, R.A. (2008). Sequential design approaches for bioequivalence studies with crossover designs. Pharmaceutical Statistics, 7, 245–262.
Karalis, V. and Macheras, P. (2013). An insight into the properties of a two-stage design in bioequivalence studies. Pharmaceutical Research, 30, 1824–1835.
Fuglsang, A. (2014). Futility rules in bioequivalence trials with sequential designs. The AAPS Journal, 16, 79-82.
Karalis, V. and Macheras, P. (2014). On the statistical model of the two-stage designs in bioequivalence assessment. Journal of Pharmacy and Phamracology, 66, 48–52.
Kieser, M. and Rauch, G. (2015). Two-stage designs for crossover bioequivalence trials. Statistics in Medicine, 34, 2403-2416.
Schütz, H. (2015). Two-stage designs in bioequivalence trials. European Journal of Clinical Trials, 71, 271-281.
Pocock SJ. (1977). Group sequential methods in the design and analysis of clinical trials. (1977). Biometrika, 64, 191–199.
O’Brien PC and Fleming TR. (1979). A multiple testing procedure for clinical trials. Biometrics, 35, 549–556.
Lan KG and DeMets DL. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70, 659–663.

Supplementarymaterial.docx

Download PDF

Journal Publication

published 24 May, 2024

Read the published version in Therapeutic Innovation & Regulatory Science →

Reviewers agreed at journal
28 Jan, 2024
Reviewers invited by journal
22 Jan, 2024
Editor assigned by journal
02 Jan, 2024
First submitted to journal
28 Dec, 2023

You are reading this latest preprint version

Performance evaluation of interim analysis in bioequivalence studies.

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Methodology

2.1 Group sequential design

2.2 Adaptive design

3. Simulation study

3.1 Simulation study design

3.2 Simulation results

4. Conclusion

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 1